Methods and systems for analyzing complex biological systems

ABSTRACT

The present invention provides methods and systems for organizing complex and disparate data. More specifically, the present invention provides methods and systems for organizing complex and disparate data into coherent data sets. Coherent data sets resulting from the methods and systems of the present invention serve as models for biological systems. Methods and systems for integrating data and creating coherent data sets are useful for numerous biological applications, such as, for example, determining gene function, identifying and validating drug and pesticide targets, identifying and validating drug and pesticide candidate compounds, profiling drug and pesticide compounds, producing a compilation of health or wellness profiles, determining compound site(s) of action, identifying unknown samples, and numerous other applications in the agricultural, pharmaceutical, forensic, and biotechnology industries.

RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. ProvisionalApplication No. 60/414,488, filed Sep. 27, 2002; U.S. ProvisionalApplication No. 60/408,721, filed Sep. 6, 2002; U.S. ProvisionalApplication No. 60/407,840, filed Sep. 3, 2002; U.S. ProvisionalApplication No. 60/404,233, filed Aug. 16, 2002; U.S. ProvisionalApplication No. 60/384,445, filed May 30, 2002; U.S. ProvisionalApplication No. 60/379,562, filed May 10, 2002; U.S. ProvisionalApplication No. 60/374,229, filed Apr. 19, 2002; U.S. ProvisionalApplication No. 60/372,679, filed Apr. 15, 2002; U.S. ProvisionalApplication No. 60/368,776, filed Mar. 29, 2002; U.S. ProvisionalApplication No. 60/363,685, filed Mar. 12, 2002; U.S. ProvisionalApplication No. 60/356,994, filed Feb. 14, 2002; U.S. ProvisionalApplication No. 60/344,953, filed Dec. 21, 2001; and U.S. ProvisionalApplication No. 60/331,948, filed Nov. 21, 2001. All of the foregoingpatent applications are incorporated in their entirety by reference.

[0002] The present application is related to U.S. application Ser. No.______, filed Nov. 20, 2002, titled “Methods and Systems for AnalyzingComplex Biological Systems” (Attorney Docket Number 2114US1); U.S.application Ser. No. ______, filed Nov. 20, 2002, titled “Methods andSystems for Analyzing Complex Biological Systems” (Attorney DocketNumber 2114US2); U.S. application Ser. No. ______, filed Nov. 20, 2002,titled “Methods and Systems for Analyzing Complex Biological Systems”(Attorney Docket Number 2114US3); U.S. application Ser. No. ______,filed Nov. 20, 2002, titled “Methods and Systems for Analyzing ComplexBiological Systems” (Attorney Docket Number 2114US4); U.S. applicationSer. No. ______, filed Nov. 20, 2002, titled “Methods and Systems forAnalyzing Complex Biological Systems” (Attorney Docket Number 2114US6);U.S. application Ser. No. ______, filed Nov. 20, 2002, titled “Methodsand Systems for Analyzing Complex Biological Systems” (Attorney DocketNumber 2114US7); U.S. application Ser. No. ______, filed Nov. 20, 2002,titled “Methods and Systems for Analyzing Complex Biological Systems”(Attorney Docket Number 2114US8); U.S. application Ser. No. ______,filed Nov. 20, 2002, titled “Methods and Systems for Analyzing ComplexBiological Systems” (Attorney Docket Number 2114US9); and U.S.application Ser. No. ______, filed Nov. 20, 2002, titled “Methods andSystems for Analyzing Complex Biological Systems” (Attorney DocketNumber 2114US 10).

FIELD OF THE INVENTION

[0003] The present invention provides a method for organizing complexand disparate biological data into a single, logical data set.Specifically, the method of the present invention pertains to thecreation of a common data currency for integrating and analyzing largequantities of heterogeneous data. The invention is useful in multipleapplications, including applications in the agricultural,pharmaceutical, forensic, and nutriceutical industries.

BACKGROUND OF THE INVENTION

[0004] The application of genomics to life science industries promisesto change the way pharmaceutical, agricultural, and biotechnologycompanies operate, saving significant amounts of time and money in thedevelopment of new and efficacious products. The original core conceptof genomics research was that obtainment of a genomic sequence of anorganism would lead directly to identification of every gene in theorganism and an unambiguous determination of the function of eachidentified gene. Assumptions serving as a foundation for theconceptualized promise of genomic research are reliant upon two basictenets. First, a basic paradigm of molecular biology is that each geneencodes one protein having one function. Second, it is assumed that byperforming homology-based sequence comparisons, scientists can identifythe function of most genes based on the sequence information availablefrom public databases. Unfortunately, both of these assumptions havefaults and as a result, the genomics era has yet to provide anaccelerated route from gene discovery to blockbuster product. Anadditional complicating factor in the study of biological systems isthat protein function is often defined in the context of a givensituation, i.e. through interactions with other proteins and withinspecific cellular and subcellular compartments.

[0005] The assumption of a linear relationship between gene and functionis now being recognized as overly simplistic, at best. A“cause-and-effect” relationship between a single gene, its product, anda phenotype (or disease state) is the exception, not the rule. Somehighly successful biopharmaceutical products, including insulin andcrythropoietin, operate through their ability to modulate such linearrelationships. However, problems such as ligand redundancies andcell-type specificities obfuscate the development of a pharmaceutical oragricultural product. To further complicate matters, many systemsoperate through nonlinear dose dependencies. In other words, at oneconcentration a compound may have one effect (such as ananti-inflammatory effect), while at a different concentration in thesame cell type the compound may have an opposite effect (such as apro-inflammatory effect). Issues of ligand redundancy, cell-typespecificity, and nonlinear dose dependency are difficult to reconcile ina product development environment, even in cases where gene function isknown or predictable. To further complicate matters, many diseases arepolygenic, so not only do multiple gene products require identification,but alternate treatment compounds are likely required to address therole each gene product plays in a disease process. M. Khodadoust & T.Klein, 19 NATURE BIOTECH. 707 (2001).

[0006] For years it was assumed that gene function was determinable byobtaining a gene sequence and performing a homology-based comparison.The central dogma is that similar sequence equals similar structure thatequals similar function. Gene annotations found in public databases arefar from infallible and overreliance on them may misdirect researchefforts. In many cases, only a very small percentage of any given genomeis actually experimentally annotated. Homology sequence comparisons andblanket application of the central dogma supply the remainingannotation. While amino acid identity greater than 40 percent of twocomplete protein sequences infers structural similarity, it does notnecessarily infer functional similarity. Additional sequenceconservation in an active site region is required for accurateprediction of function. Wilson et al., 297 J. MOL. BIOL. 233-249 (2000).Proteins are typically organized into families based on the similarityof three-dimensional structures. In some cases, members of the sameprotein family may have no detectable sequence similarity, illustratingthat structural similarities do not necessarily impute sequencesimilarities, and vice versa. Current annotation available from publicsources is largely incomplete, and as a result, sequence comparison isnot a viable approach to determining the relative roles of genessequenced in genomics projects.

[0007] To meet the challenge of understanding complex biologicalsystems, scientists require the ability to analyze complex data sets. Asnoted above, the sequencing of entire genomes has not led to an industrypipeline bulging with new life sciences products, nor has it led to anunderstanding of the function of all the sequenced genes. Currently,less than 5 percent of genes with annotation available from a publicdatabase are sufficiently well annotated for the information to be useddirectly in the development of products. As a result, a number ofresearch technologies, such as gene expression profiling, metaboliteanalysis, phenotypic profiling, proteomics, 3-D protein structuralanalysis, protein expression, identification of biochemical pathways ornetworks, genotyping (including polymorphisms) and scientific literaturetools are under development to help identify gene function. Eachtechnology has its strengths and weaknesses and no single existingtechnology is sufficient to identify the function of all genes.

[0008] Since no single technology is the answer to gene functionidentification, the challenge is to combine data from differenttechnology types in resultant data sets that are meaningful.Unfortunately, combining data from various sources is wrought withsubstantial technical problems in data organization and data analysis.Research technology systems organize data in different ways. Differentresearch technologies use different analysis tools, which askconceptually different questions. Analysis tools used in associationwith different technologies can provide dissimilar and evencontradictory conclusions with respect to gene function and other dataend points. It seems likely that for the majority of genes, theidentification of function will only become possible if data from avariety of sources and technologies are organized as a single, logicaldata set. That is, the potential of multi-technology genomic researchhas not yet been realized because there is no common currency forintegration and analysis of large quantities of heterogeneous data.Thus, there exists a need for the development of a meaningful way toproduce and analyze multi-technology-derived data to provide scientistswith yet untapped knowledge to aid in the development of new andefficacious agricultural, pharmaceutical, forensic, and nutriceuticalproducts.

SUMMARY OF THE INVENTION

[0009] The present invention provides methods and systems for organizingcomplex and disparate data into coherent data sets. Coherent data setsserve as models for biological systems under examination. Methods andsystems for integrating data and creating coherent data sets are usefulfor numerous biological applications, such as, for example, determininggene function, identifying and validating drug and pesticide targets,identifying and validating drug and pesticide candidate compounds,profiling of drug and pesticide compounds, producing a compilation ofhealth or wellness profiles, determining compound site(s) of action,identifying unknown samples, and numerous other applications in theagricultural, pharmaceutical, forensic, and biotechnology industries.

[0010] The invention provides methods and systems for creating coherentdata sets for modeling biological systems, wherein the methods includeentering a unique identifier of a biological sample into a computertracking system, and storing data in the computer tracking system,wherein the data are linked to the unique identifier. All linked dataare converted to a numeric format, and the numeric data are converted toa common unit system, wherein the common unit system data are a coherentdata set and can serve as a model for a biological system. The methodsand systems of the invention are not limited in terms of the order inwhich the data are linked to the identifier or converted to numeric andcommon unit system format. For example, in an alternative embodiment ofthe invention, numeric format data or common unit system data arecollected; the data are linked to a unique identifier; and the data arestored in the computer tracking system.

[0011] In one embodiment, the invention provides a method and a systemfor creating coherent data sets for modeling biological systems, whereinthe method includes entering a unique identifier of a biological sampleinto a computer tracking system, and storing in the computer trackingsystem disparate data, wherein the disparate data comprise at least twotypes of data, and the disparate data are linked to the uniqueidentifier. The linked disparate data are converted to a numeric format,and the numeric data are converted to a common unit system, wherein thecommon unit system data are a coherent data set and can serve as a modelfor a biological system.

[0012] In another embodiment, the invention provides a method and asystem for creating coherent data sets for modeling biological systems,wherein the method includes entering a unique identifier of a biologicalsample into a computer tracking system, and storing in the computertracking system disparate data, wherein the disparate data comprise atleast three types of data, and the disparate data are linked to theunique identifier. The linked disparate data are converted to a numericformat, and the numeric data are converted to a common unit system,wherein the common unit system data are a coherent data set and canserve as a model for a biological system.

[0013] In yet another embodiment, the invention provides a method and asystem for establishing a signature profile indicative of thephysiological status of an individual, wherein the method includesentering a unique identifier of at least one biological sample into acomputer tracking system and storing in the computer tracking systemdata, wherein the data are linked to the unique identifier. The linkeddata are converted to a numeric format, and the numeric data areconverted to a common unit system, wherein the common unit system dataare a coherent data set. The most informative of the common unit systemdata are determined, wherein the most informative data are a signatureprofile indicative of physiological status.

[0014] In still another embodiment, the invention provides a method anda system for examining chemical components in biological samples,comprising entering a unique identifier of at least one biologicalsample into a computer tracking system and simultaneously collectingdata from the sample, for a plurality of peaks, each peak comprising atleast one chemical component, wherein the data comprise data from atleast two processes. The data from the sample are stored in the computertracking system, wherein the data are linked to the unique identifier,and the chemical components are characterized and/or identified.

[0015] In another embodiment, the invention provides a method and asystem for examining chemical components in biological samples,comprising entering a unique identifier of at least one biologicalsample into a computer tracking system and simultaneously collectingdata from the sample, for a plurality of peaks, each peak comprising atleast one chemical component, wherein the data comprise data from atleast three processes. The data from the sample are stored in thecomputer tracking system, wherein the data are linked to the uniqueidentifier, and the chemical components are characterized and/oridentified.

[0016] In yet another embodiment, the invention provides a method and asystem for examining metabolites in biological samples, comprisingentering a unique identifier of at least one biological sample into acomputer tracking system and simultaneously collecting data from thesample, for a plurality of peaks, each peak comprising at least onechemical component. The data from the sample are stored in the computertracking system, wherein the data are linked to the unique identifier,and the chemical components are characterized and/or identified. Thecharacterized and/or identified chemical components are linked tometabolites in biochemical pathways.

[0017] In still another embodiment, the invention provides a method anda system for establishing a signature profile indicative of thephysiological status of an individual, comprising entering a uniqueidentifier of at least one biological sample into a computer trackingsystem, and collecting and storing in the computer tracking systemmetabolite data, wherein the data are linked to the unique identifier.The linked data are compared to a reference, and the most informative ofthe compared data are determined, wherein the most informative data area signature profile indicative of physiological status.

BRIEF DESCRIPTION OF THE FIGURES

[0018]FIG. 1 depicts various indicators that can be examined todetermine the biological status of an individual.

[0019]FIG. 2 is a representation of the parallel nature of thepharmaceutical and agrochemical product discovery and developmentprocesses.

[0020]FIG. 3 is a diagram representing the construction of an endogenousmetabolite database.

[0021]FIG. 4 is a schematic diagram illustrating an example ofintegrated data. In the example, gene expression was experimentallyaltered for a particular gene identified as Gene_ID. The unique geneidentifier, Gene_ID, is linked in a computer tracking system to the geneannotation, the relative amount of gene substrates/products, therelative amount of gene transcript, and the phenotype of the organism inwhich the gene was altered.

[0022]FIG. 5 is a schematic diagram illustrating FUNCTIONFINDERtechnology, comprising four interrelated components: databases, dataprocessing, data analysis tools, and user interfaces.

[0023]FIG. 6 is a graphical depiction of the results of a clusteranalysis performed on phenotypic data corresponding to plants in whichthe expression of a particular gene was knocked out using antisensetechnology. The x-axis of the graph represents the particular geneidentifier and the y-axis is the maximum distance between clusters.

[0024]FIG. 7 is a graphical depiction illustrating the relative responseof a multitude of compounds in a biological sample data relative to abaseline. Each compound is represented on the y-axis and is plotted asnumber of standard deviations from the baseline on the x-axis. Forexample, compound 700, sinapinic acid, is present in the sample at aresponse that is slightly less than 2 standard deviations above that ofthe baseline. Compound 702, hydroxyphenol pyruvic acid, is present at aresponse that is slightly more than 2 standard deviations below that ofthe baseline.

[0025] FIGS. 8A-8C are a visualization of principal components analysisof phenotypic, gene expression, and metabolite data collected forArabidopsis plants treated with the eighteen different herbicides inTable 3. The data were normalized to a baseline prior to the analysis.Each of the nine herbicide site of action groups are represented by aseparate symbol. FIG. 8A) Gene expression data (y-axis) and metabolitedata (x-axis). FIG. 8B) Phenotypic data (y-axis) and gene expressiondata (x-axis). C) Phenotypic data (y-axis) and metabolite data (x-axis).None of the pair wise analyses resulted in accurate grouping of theherbicides by site/mode of action.

[0026] FIGS. 9A-9B are two different views of a 3-dimensional graphicaldepiction of 3 types of hypothetical data. The figure was generated todemonstrate that interpretation of data may change depending on theparticular view. For example, at an axis rotation of 50° horizontal and20° vertical (FIG. 9A) two separate clusters are observable, while at anaxis rotation of 95° horizontal and 15° vertical (FIG. 9B) threeseparate clusters are visible.

[0027]FIG. 10 is a diagram illustrating one example of the creation anduse of a coherent data set, in which hypotheses are formed and tested bylaboratory experiments.

[0028] FIGS. 11A-11B are a three dimensional plot of mass spectralelectrospray ionization chromatograms (LC-MS-ESI) of mouse tissuesamples showing retention time, compound number and relative response.The left side of the plots (left of 0.0) depicts the positive modechromatograms and the right side depicts the negative modechromatograms. FIG. 11A) Mouse heart tissue. FIG. 11B) Mouse kidneytissue.

[0029] FIGS. 12A-12G are images depicting the phenotypes ofthree-week-old Arabidopsis plants treated with a herbiciderepresentative of each of the six symptom classes listed in Table 3.Herbicides were applied in either 15% DMSO or 20% tetrahydrofurfuralalcohol. The negative control contained a corresponding solution lackingherbicide. Plants treated with the herbicides displayed six separatephenotypes depicted in panels B-G. FIG. 12A) Phenotype representative ofnegative control plants. FIG. 12B) Phenotype representative of Amitroletreated plants. FIG. 12C) Phenotype representative of Glufosinatetreated plants. FIG. 12D) Phenotype representative of Glyphosate;Imazapyr; Imazethapyr; and Chlorosulfuron treated plants. FIG. 12E)Phenotype representative of 2,4-D; Dicamba; and Benazolin treatedplants. FIG. 12F) Phenotype representative of Acifluorfen and Bifenoxtreated plants. FIG. 12G) Phenotype representative of Atrazine;Metribuzin; Diuron; Bentazon; Paraquat; Diquat and Metolachlor treatedplants.

[0030] FIGS. 13A-13F are graphical representations of the results ofcluster analysis of gene expression and biochemical profile datacollected for Arabidopsis plants treated with the 18 herbicides listedin Table 3. Gene expression and biochemical profiles were derived bycalculating the average response for the control treatments andstandardizing the average test responses to the respective controlaverages in units of standard deviations. FIG. 13A) Gene expressionprofile data collected at early time point. FIG. 13B) Gene expressionprofile data collected at middle time point. FIG. 13C) Gene expressionprofile data collected at late time point. FIG. 13D) Biochemical profiledata collected at early time point. FIG. 13E) Biochemical profile datacollected at middle time point. FIG. 13F) Biochemical profile datacollected at late time point. The biochemical and gene expressionprofile data were clustered using SAS PROC CLUSTER and SAS PROC TREE wasused to produce the dendrograms. The nine herbicide groups according tosite of action are represented as follows: ◯=Glyphosate; □=Gulfosinate;▴=Acifluorfen and Bifenox; ▾=Imazapyr, Imazethapyr, and Clorosulfuron;=Atrazine, Metribuzin, Diuron, and Bentazon; ⋄=Paraquat and Diquat;▪=2,4-D; Dicamba and Benazolin; ♡=Amitrole; and ♦=Metolachlor.

[0031]FIG. 14 is a three-dimensional graphical representation of acoherent data set where the first principal component of each of thephenotypic data, the biochemical profile data and the gene expressionprofile data is represented on the y-axis, z-axis and x-axis,respectively. The plot was made using Spotfire DECISIONSITE. Principlecomponents analysis was performed separately on the phenotypic,biochemical, and gene expression profile data, using SAS PROC PRINCOMP.The principle components were used to derive a linear discriminant ruleusing SAS PROC DISCRIM with equal priors. The rule indicated 100%correct classification of the herbicides by SOA. The nine herbicidegroups according to site of action are represented as follows:

=Glyphosate;

=Gulfosinate;

=Acifluorfen and Bifenox; ♦=Imazapyr, Imazethapyr, and Clorosulfuron;♡=Atrazine, Metribuzin, Diuron, and Bentazon;

=Paraquat and Diquat; =2,4-D; Dicamba and Benazolin; ▪=Amitrole; and

=Metolachlor.

[0032] FIGS. 15A-15L display the phenotype of Arabidopsis plants treatedwith five different compounds (Unknown 1 to Unknown 5) suspended in twodifferent spray formulations, THFA and Tween 80. The images were takenfive days after treatment. FIG. 15A) Negative control treated with THFAalone. FIG. 15B) Treated with Unknown 1 in THFA. FIG. 15C) Treated withUnknown 2 in THFA. FIG. 15D) Treated with Unknown 3 in THFA. FIG. 15E)Treated with Unknown 4 in THFA. FIG. 15F) Treated with Unknown 5 inTHFA. Figure G) Negative control treated with Tween 80 alone. FIG. 15H)Treated with Unknown 1 in Tween 80. FIG. 15I) Treated with Unknown 2 inTween 80. FIG. 15J) Treated with Unknown 3 in Tween 80. FIG. 15K)Treated with Unknown 4 in Tween 80. FIG. 15L) Treated with Unknown 5 inTween 80.

[0033]FIG. 16 is a graphical representation of the hierarchicalclustering of gene expression data from Arabidopsis plants treated withfive unknown compounds (Unknown 1 to Unknown 5) and five commerciallyavailable herbicides. Data were derived from tissue harvested one hourfollowing treatment. The name of the treatment (x-axis) is plottedversus the semi partial r squared value (y-axis).

[0034]FIG. 17 is a graphical representation of the hierarchicalclustering of gene expression data, metabolite data, and phenotypic datafrom Arabidopsis plants treated with five unknown compounds (Unknown 1to Unknown 5) and five commercially available herbicides. Data werederived from tissue harvested one hour following treatment. The name ofthe treatment (x-axis) is plotted versus the semi partial r squaredvalue (y-axis).

[0035] FIGS. 18A-18D are schematic diagrams of the chemical structuresof the antifungal drugs as follows: FIG. 18A) Amphoteracin B; FIG. 18B)Fluconazole; FIG. 18C) Ketoconazole; and FIG. 18D) Posaconazole.

[0036]FIG. 19 illustrates the mapping of genes to pathways based on dataobtained from experiment AF1, which examined the effects of theantifungal drugs Amphoteracin B, Ketoconazole, Fluconazole, andPosaconazole on yeast cells. Yeast gene accession numbers were parsedfrom KEGG pathway files resulting in the mapping of 1145 genes to 103pathways. The percentage of genes (y-axis) is plotted versus the numberof pathways (x-axis).

[0037]FIG. 20 illustrates the mapping of compounds to pathways based ondata obtained from experiment AF1, which examined the effects of theantifungal drugs Amphoteracin B, Ketoconazole, Fluconazole, andPosaconazole on yeast cells. The percentage of compounds (y-axis) isplotted versus the number of pathways (x-axis). By linking throughenzymes, 676 compounds were linked to 92 separate pathways. The 77compounds detected in the experiment were mapped to 69 separatepathways.

[0038] FIGS. 21A-21D depicts the pathway score attributed to geneexpression data derived from yeast cells treated with antifungalcompounds, Amphoteracin B, Ketoconazole, Fluconazole, and Posaconazole,in the AF1 study. The yeast genes most perturbed in the treated cellswere linked to KEGG pathways (y-axis) and assigned a pathway score(x-axis) according to Equation 1. FIG. 21A) Amphoteracin B; FIG. 21B)Fluconazole; FIG. 21C) Ketoconazole; and FIG. 21D) Posaconazole.

[0039]FIG. 22 is an illustration of the result obtained when theprincipal components (gene expression analysis and metabolite analysis)of the AF1 study are subjected to clustering analysis. The name of thetreatment (x-axis) is plotted versus the semi partial r squared value(y-axis).

[0040]FIG. 23 is an illustration of the ergosterol biochemical pathway,showing where the azole drugs examined in the AF1 study have theireffect.

DETAILED DESCRIPTION

[0041] For clarity and consistency, the following definitions will beused throughout this patent document. To the extent that the followingdefinitions conflict with other definitions for the defined terms, thefollowing definitions shall control.

[0042] “Agriculture” or “agricultural,” as used in this document, refersto the science, art, or practice of cultivating the soil, producingcrops, and raising livestock and in varying degrees the preparation andmarketing of the resulting products. Thus, development of agriculturalproducts includes development of pesticides against organisms harmful tocrops and/or livestock, as well as development of products to enhancethe health and market value of livestock and crops, such as improvedagronomic traits in crop plants.

[0043] Identifying a “baseline” value is an essential element tobiological experimentation and provides, but is not limited to, amechanism for distinguishing experimental error from biologicalvariation. A baseline is used in the invention to standardize data to acommon or commonly relevant unit of measure. The term “baseline” isherein used to refer to and interchangeably with “reference” and“control.” Baseline populations consist, for example, of data fromorganisms of a particular group, such as healthy or normal organisms, ororganisms diagnosed as having a particular disease state,pathophysiological condition, or other physiological state of interest.An example of the use of a baseline is the expression of datameasurements as standard deviations from the corresponding baselinemean.

[0044] “Biochemical pathway” is a term commonly used to define a seriesof biochemical reactions that are linked one to another, i.e., theproduct of one reaction is a substrate for the subsequent reaction.Biochemical pathway is not limited to linearity with respect tobiochemical reactions of biological organisms. Rather, biochemicalpathway is understood to include individual pathways that function asnetworks of interrelated biochemical reactions.

[0045] The phrase “chemical components” refers to small molecules,including endogenous metabolites, and any derivative or degradationproduct thereof.

[0046] As used herein, a “coherent data set” is a data set comprised ofdisparate data that is: integrated; expressed in a numeric format;converted to a common unit system; and optionally, dimensionallyreduced. Certain types of data are generally expressed in numeric formatwhile other types of data require conversion to numeric format. Thosedata in numeric format are converted to a common unit system relative toa baseline value. The term “baseline” is herein used to refer to andused interchangeably with “control” and “reference.” Certain data, forexample, phenotypic data are not generally expressed in numeric format.Such non-numeric data, for example, leaf necrosis and cellular dysplasiaare converted to a numeric scale relative to a baseline value. As thenumber of data points associated with different types of measurementscan differ by orders of magnitude, the data are balanced as necessary,so that direct comparisons are meaningful. The dimensionality of thedata is reduced, for example, in cases where there are many measurementsobtained for a first type of data and fewer measurements for a secondtype of data. Dimensionality reduction is viewed as “balancing”individual data types to form a coherent data set, and may beaccomplished, for example, by applying principle components analysis.The coherent data sets of the present invention serve as models forbiological systems.

[0047] Coherent data sets comprised of cumulatively greater quantitativeand qualitative indicators of biological status result in increasinglycomprehensive data sets capable of reaching increasingly accuratebiological predictions and conclusions. One characteristic of a coherentdata set is that it is dynamic, so that previously non-incorporated datacan be added as it is obtained or becomes available. The process forincorporating new data is iterative; the steps listed above are repeatedwith the inclusion of the new data. One purpose for creating a coherentdata set is to obtain new information otherwise not available prior todata combination and analysis as a set.

[0048] “Integrated data” are data linked to, or associated with, aunique identifier of a biological sample from which the data wereobtained.

[0049] For the purpose of this invention, “metabolites” refers to thenative small molecules (e.g. non-polymeric compounds) involved inmetabolic reactions required for the maintenance, growth, and functionof a cell. Enzymes, other proteins, and most peptides are generally notsmall molecules and thus excluded. Many proteins participate inbiochemical reactions with small molecules (e.g. isoprenylation,glycosylation, and the like). The construction and degradation ofpolypeptides results in either the consumption or generation of smallmolecules and, thus, the small molecules rather than the proteins aremetabolites. Genetic material (all forms of DNA and RNA) is alsoexcluded as a metabolite based on size and function. The constructionand degradation of polynucleotides results in either the consumption orgeneration of small molecules and, thus, the small molecules rather thanthe polynucleotides are metabolites. Structural molecules (e.g.glycosaminoglycans and other polymeric units) similarly may beconstructed of and/or degraded to small molecules, but do not otherwiseparticipate in metabolic reactions. Thus, structural molecules areexcluded as metabolites. Polymeric compounds such as glycogen areimportant participants in metabolic reactions, but are not chemicallydefineable and are a source of metabolites (i.e. an input/output tometabolism). Thus, polymeric compounds are excluded as metabolites.Metabolites of xenobiotics are neither native, required for maintenanceor growth, nor required for normal function of a cell, and thus are notmetabolites. However, it is useful to monitor xenobiotics when observingthe effects of a drug therapy program, or in experimentally determiningthe effects of a compound on an individual. Essential or nutritionallyrequired compounds are not synthesized de novo, (i.e. not native), butare required for the maintenance, growth, or normal function of a cell.Therefore, essential or nutritionally required compounds aremetabolites.

[0050] “Morphology” refers to the form and structure of an organism orany of its parts. Morphology is one way of referring to a phenotype.

[0051] “Peak” refers to the readout from any type of spectral analysisor metabolite analysis instrumentation, as is standard in the art, andcan represent one or more chemical components. The instrumentation caninclude, but is not limited to, liquid chromatography (LC),high-pressure liquid chromatography (HPLC), mass spectrometry (MS),hyphenated detection systems such as MS-MS or MS-MS-MS, gaschromatography (GC), liquid chromatography/mass spectroscopy (LC-MS),gas chromatography/mass spectroscopy (GC-MS), Fourier transform-ioncyclotron resonance-mass spectrometry (FT-MS), nuclear magneticresonance (NMR), magnetic resonance imaging (MRI), Fourier TransformInfraRed (FT-IR), and inductively coupled plasma mass spectrometry(ICP-MS). It is further understood that mass spectrometry techniquesinclude, but are not limited to, the use of magnetic-sector and doublefocusing instruments, transmission quadrapole instruments, quadrupoleion-trap instruments, time-of-flight instruments (TOF), Fouriertransform ion cyclotron resonance instruments (FT-MS), andmatrix-assisted laser desorption/ionization time-of-flight massspectrometry (MALDI-TOF MS). It is understood that the phrase “massspectrometry” is used interchangeably with “mass spectroscopy” in thisapplication.

[0052] “Phenotype” refers to the observable physical, morphological,and/or biochemical/metabolic characteristics of an organism, asdetermined by genetic and/or environmental factors.

[0053] “Types of data,” as used herein, refers to data derived fromdifferent biological indicators. For example, types of data include, butare not limited to, data from DNA, data from RNA, data from proteins,data from metabolites, and data from phenotypic characteristics. Typesof data are obtained by any process or technique known in the art; theprocess or technique used is immaterial to the creation of the coherentdata set. However, the process or technique from which the data emanatesmay affect how the data are integrated. “Disparate data” are comprisedof different types of data.

[0054] The present invention provides methods for organizing complex anddisparate data into logical coherent data sets. Such coherent data setsserve as models for biological systems under examination. The presentinvention provides methods for integration and analysis of largequantities of heterogeneous data. The invention is useful in numerousapplications in the agricultural, pharmaceutical, forensic,nutriceutical and biotechnology industries. Integration of data andformation of coherent data sets can be employed in a variety ofsettings, such as determining gene function; identifying drug,pesticide, and nutriceutical targets; identifying drug, nutriceutical,and pesticide compound candidates; profiling drug, nutriceutical, andpesticide compound candidates; producing a compilation of health orwellness profiles for prognostic and diagnostic use; determiningcompound site(s) of action; and identifying unknown samples, such as ina forensic setting.

[0055] Technologies abound which generate data useful in determininggene function. Gene expression analysis, phenotypic analysis, metaboliteanalysis, proteomics, 3-D protein structural analysis, and proteinexpression all provide valuable data in a quest for gene functiondetermination. Scientific tools, techniques, and technologies, incombination with nucleotide sequence data, single nucleotidepolymorphism (SNP) data, scientific literature, clinical chemistry data,and biochemical pathway data, can provide tremendous insight into theworkings of complex biological systems when the data are combined toform coherent data sets.

[0056] The invention provides a method for standardizing and combiningdisparate data for modeling biological systems. Methods of the presentinvention include entering a unique identifier of a sample into acomputer tracking system, and storing in the computer tracking systemdata, wherein the data are linked to the unique identifier. All linkeddata are converted to a numeric format, and the numeric data areconverted to a common unit system, wherein the common unit system datais a coherent data set and serves as a model for a biological system.Another embodiment of the invention comprises entering a uniqueidentifier of a sample into a computer tracking system, and collectingand storing in the computer tracking system data, wherein the data arelinked to the unique identifier. All linked data are converted to anumeric format, and the numeric data are converted to a common unitsystem. The methods of the invention are not limited in terms of theorder in which the data are linked to the identifier or converted tonumeric and common unit system format. For example, in one embodiment ofthe invention, numeric format data or common unit system data arecollected; the data are linked to a unique identifier; and the data arestored in the computer tracking system.

[0057] In one embodiment of the present invention the data are RNA data(gene expression analysis), phenotypic data, and metabolite data(biochemical profiling analysis), but one skilled in the art willunderstand that data from any technology or process may be utilized inthe methods of the invention. Further, it is understood by one skilledin the art that data from any biological organism (alive or dead) orpart thereof may be incorporated into a coherent data set. Suitablebiological organisms include, but are not limited to, plants, such asArabidopsis (Arabidopsis thaliana) and rice, fungal organisms includingMagnaporthe grisea, Saccharomyces cerevisiae, and Candida albicans, andmammals, including rodents, rabbits, canines, felines, bovines, equines,porcines, and human and non-human primates.

[0058] Suitable sample parts of biological organisms include, but arenot limited to, human and animal tissues such as heart muscle, liver,kidney, pancreas, spleen, lung, brain, intestine, stomach, skin,skeletal muscle, uterine muscle, ovary, testicle, prostate, and bone;human and animal fluids such as blood, plasma, serum, urine, mucus,semen, sweat, tears, amniotic fluid, milk; freshly harvested cells suchas hepatocytes or spleen cells; immortal cell lines such as the humanhepatocyte cell line HepG2 or the mouse fibroblast line L929; human andanimal cells grown in culture as three-dimensional culture spheres (e.g.liver spheroids); and plant tissues such as cotyledons, leaves, seeds,open flowers, pistils, senescent flowers, sepals, siliques, and stamens.

[0059] Gene expression analysis (GEA) refers to a simultaneous analysisof the expression levels of multiple genes. Traditionally, theexpression of individual genes was analyzed by a technique calledNorthern-blot analysis. In a Northern-blot, RNA is separated on a gel,transferred to a membrane, and a specific gene is identified viahybridization to a radioactive complementary probe, usually made fromDNA. A technological improvement in the area of GEA has been thedevelopment of small 1-2 cm chips used to concurrently determineexpression levels of multiple genes from mulitple samples. In a genechip format, probes for the genes of interest are ordered as an array ona glass slide. After hybridization to appropriate samples, geneexpression changes are often visualized with colors overlaid on an imageof the chip. The color indicates the gene expression level and thelocation indicates the specific gene being monitored. Other technologiescan be used to obtain the same type of gene information, includinghigh-density array spotting on glass or membranes and quantitative PCR.

[0060] Phenotype refers to the observable physical orbiochemical/metabolic characteristics of an organism, as determined bygenetic and environmental factors. For example, in an Arabidopsisthaliana plant model system, a phenotype can be described by usingdistinctly defined attributes such as, but not limited to, number of:abnormal seeds, cotyledons, normal seeds, open flowers, pistils perflower, senescent flowers, sepals per flower, siliques, and stamens.Many times, perturbation of a biological system is indicated by aphenotypic trait. In humans, a perturbed biological system may result insymptoms disease such as chest pain, signs such as elevated bloodpressure, or observable physical traits such as those exhibited byindividuals afflicted with Trisomy 21. A normal phenotype is useful as areference, or baseline value, against which a physiological status canbe measured.

[0061] Medical history, examination, and testing techniques are wellknown to medical practitioners and data derived from the same can beused in practicing the methods and systems of the present invention. Forexample, in cases where a practitioner is examining a patient todetermine the likelihood, existence, or extent of coronary heart disease(CHD), phenotypic traits observed or identified in a clinical settinginclude, but are not limited to, risk factors such as blood pressure,cigarette smoking, total cholesterol (TC), low density lipoproteincholesterol (LDL-C), high density lipoprotein cholesterol (HDL-C), anddiabetes. P. G. McGovern et al., 334 NEW ENG. J. MED. 884-890 (1996).Additonal phenotypic characteristics such as weight, family history ofCHD, hormone replacement therapy, and left ventricular hypertrophy arealso useful in determining CHD risk. It is common in the medical arts toscale or score a patient's condition based on a set of phenotypic signsand symptoms. For example, predictive models have been described basedon blood pressure, cholesterol, and LDL-C categories as identified bythe National Cholesterol Education Program and the Joint NationalCommittee on Detection, Evaluation, and Treatment of High BloodPressure. P. W. F. Wilson et al., 97 CIRCULATION 1837-1847 (1998)(incorporated herein by reference). Furthermore, predictive outcomemodels have also been described for patients undergoing coronary arterybypass grafting surgery and percutaneous transluminal coronaryangioplasty.

[0062] Medical scoring of phenotypic triats are applicable to theassessment of patient well-being pre- and post-therapeutic intervention.For example, Short-Form 36 (SF-36) is gaining acceptance as a generichealth outcome assessment form. The SF-36 validates health outcomes with8 indices of health and well-being including general health (GH),physical function (PF), role function due to physical limitations (RP),role function due to emotional limitations (RE), social function (SF),mental health (MH), bodily pain (BP), vitality and energy (VE). Eachhealth object is scored on a 0 to 100 basis with higher scoresrepresenting better function or less pain. Other scoring or rankingschemas for identifying and quantifying physiologic and pathophysiologic(phenotypic) states (traits) include, not are not limited, thefollowing: ATP III Metabolic Syndrome Criteria; Criteria for One YearMortality Prognosis in Alcoholic Liver Disease; APACHE II Scoring Systemand Mortality Estimates (Acute Physiology and Chronic Health diseaseClassification System II); APACHE II Scoring System by Diagnosis; ApgarScore; Arrhythmogenic Right Ventricular Dysplasia Diagnostic Criteria;Arterial Blood Gas Interpretation; Autoimmune Hepatitis DiagnosticCriteria; Cardiac Risk Index in Noncardiac Surgery (L. Goldman et al.,297 NEW ENG. J. MED. 20 (1977)); Cardiac Risk Index in NoncardiacSurgery (A. S. Detsky et al., 1 J. GEN. INT. MED. 211-219 (1986)); ChildTurcotte Pugh Grading of Liver Disease Severity; Chronic FatigueSyndrome Diagnostic Criteria; Community Acquired Pneumonia SeverityScale; DVT Probability Score System; Ehlers-Danlos Syndrome IV (VascularType) Diagnostic Criteria; Epworth Sleepiness Scale (ESS); FraminghamCoronary Risk Prediction (P. W. F. Wilson et al., 97 CIRCULATION1837-1847 (1998)); Gail Model for 5 Year Risk of Breast Cancer (M. H.Gail et al., 91 J. NAT'L CANCER INST. 1829-1846 (1999); GeriatricDepression Scale; Glasgow Coma Scale; Gurd's Diagnostic Criteria for FatEmbolism Syndrome; Hepatitis Discriminant Function for PrednisoloneTreatment in Severe Alcoholic Hepatitis; Irritable Bowel SyndromeDiagnostic Criteria (A. P. Manning et al., 2 BRIT. MED. J. 653-654(1978)); Jones Criteria for Diagnosis of Rheumatic Fever; KawasakiDisease Diagnostic Criteria; M.I. Criteria for Likelihood in Chest Painwith LBBB; Mini-Mental Status Examination; Multiple Myeloma DiagnosticCriteria; Myclodysplastic Syndrome International Prognostic ScoringSystem; Nonbiliary Cirrhosis Prognostic Criteria for One Year Survival;Obesity Management Guidelines (National Institutes of Health/NHLBI);Perioperative Cardiac Evaluation (NHLBI); Polycythemia Vera DiagnosticCriteria; Prostatism Symptom Score; Ranson Criteria for AcutePancreatitis; Renal Artery Stenosis Prediction Rule; RheumatoidArthritis Criteria (American Rheumatism Association); Romhilt-EstesCriteria for Left Ventricular Hypertrophy; Smoking Cessation andIntervention (NHLBI); Sore Throat (Pharyngitis) Evaluation and TreatmentCriteria; Suggested Management of Patients with Raised Lipid Levels(NHLBI); Systemic Lupus Erythematosis American Rheumatism Association 11Criteria; Thyroid Disease Screening for Females More Than 50 Years Old(NHLBI); and Vector and Scalar Electrocardiography.

[0063] Still other phenotypic traits could be observed or identified byx-ray; electrocardiogaphy; blood pressure (BP) examination; pulse;weight and height; ideal body weight or BMI; retinal examination;thyroid examination; carotid bruits; neck vein examination; congestiveheart failure (CHF) signs; palpable intercostal pulses; cardiovascularexamination traits including, but not limited to, S4 gallop,tachycardia, bradycardia, heart sounds, aortic insufficiency, murmur,and echocardiography; abdominal examination; genitourinary examination;peripheral vascular disease examination; neurologic examination; andskin examination. In addition to standard x-ray technologies, numerousimaging technigues are also useful in observing and identifyingphenotypic traits including, but not limited to, ultrasound, magneticresonance imaging (MRI) positron emission tomography (PET), singlephoton emission computed tomography (SPECT), x-ray tranmission x-raycomputed tomography (X-ray CT), ultrasound electrical impedancetomography (EIT), electrical source imaging (ESI), magnetic sourceimaging, (MSI) laser optical imaging.

[0064] Global assays (or global analyses) are performed as a means ofmaking gross comparisons in materials for substances including, but notlimited to, total protein, carbohydrate, and fat content.

[0065] Metabolite analysis refers to an analysis of organic, inorganic,and/or biomolecules (hereinafter collectively referred to as “smallmolecules”) of a cell, cell organelle, tissue and/or organism. It isunderstood that a small molecule is also referred to as a metabolite.Techniques and methods of the present invention employed to separate andidentify small molecules, or metabolites, include but are not limitedto: liquid chromatography (LC), high-pressure liquid chromatography(HPLC), mass spectroscopy (MS), gas chromatography (GC), liquidchromatography/mass spectroscopy (LC-MS), gas chromatography/massspectroscopy (GC-MS), nuclear magnetic resonance (NMR), magneticresonance imaging (MRI), Fourier Transform InfraRed (FT-IR), andinductively coupled plasma mass spectrometry (ICP-MS). It is furtherunderstood that mass spectrometry techniques include, but are notlimited to, the use of magnetic-sector and double focusing instruments,transmission quadrapole instruments, quadrupole ion-trap instruments,time-of-flight instruments (TOF), Fourier transform ion cyclotronresonance instruments (FT-MS), and matrix-assisted laserdesorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS).

[0066] Metabolite analysis allows the relative amounts of metabolites tobe determined in an effort to deduce a biochemical picture of physiologyand/or pathophysiology. In one embodiment of the present invention,individual metabolites present in cells are identified and a relativeresponse measured, establishing the presence, relative quantities,patterns, and/or modifications of the metabolites. In a relatedembodiment of the invention, the metabolites are linked to enzymaticreactions and metabolic pathways. In another embodiment, rather thanidentifying metabolites, the spectral properties of chemical componentsin a biological sample are characterized and the presense or absense ofthe chemical components noted. In a further embodiment of the invention,a metabolic profile is obtained by analyzing a biological sample for itsmetabolite composition under particular environmental conditions.

[0067] In one embodiment of the invention, a method is provided forexamining metabolites in a biological sample, comprising entering aunique identifier of at least one biological sample into a computertracking system; simultaneously collecting data from the sample, for aplurality of peaks, each peak comprising at least one chemicalcomponent; storing in the computer tracking system the chemicalcomponent data, wherein the data are linked to the unique identifier;characterizing and/or identifying the chemical components; and linkingthe characterized and/or identified chemical components to metabolitesin biochemical pathways.

[0068] In the methods of the invention, data is collected for aplurality of peaks, each peak comprising at least one chemicalcomponent. In the methods of the invention the plurality of peakscomprises at least 25, 30, 40, 50, 60, 75, 85, 100, 125, 150, 175, 200,225, 250, 275, 300, 350, 400, 450, 500, 550, 600, 650, 700, 750, 800,850, 900, 950, 1000, 1050, 1100, 1150, 1200, 1250, 1300, 1350, or 1400or more peaks.

[0069] In another method of the invention, a method is provided forexamining metabolites in a biological sample. The method comprisesentering a unique identifier of at least one biological sample into acomputer tracking system; simultaneously collecting data for a pluralityof peaks, each peak comprising at least one chemical component, from thesample, wherein the data comprise data from at least two processes;storing in the computer tracking system the data, wherein the data arelinked to the unique identifier; adding the linked data to a databasewherein the database comprises linkages between chemical components,biochemical pathways, and phenotype; identifying the chemicalcomponents; and querying the database for correlations between thechemical components, the biochemical pathways, and the phenotype.

[0070] In an alternate embodiment of the invention, GEA profiling,phenotypic analysis, and metabolite analysis are combined into one dataset. Inclusion of GEA data allows the level of transcription of numerousgenes to be monitored, while the inclusion of phenotypic analysis allowsobservable traits to be correlated with their molecular and cellularcauses. Inclusion of metabolite analysis data allows correlation ofsmall molecule profile data with the gene expression patterns andphenotypic characteristic data. Inclusion of biologically disparate datain a coherent data set allows creation of a model that accuratelyrepresents a biological system.

[0071] The methods and systems of the present invention include, asanother type of technology data source, SNP-derived data. SNPs, orsingle nucleotide polymorphisms, are alterations in DNA sequences thatinvolve only a single DNA base pair and may be shared by multipleindividuals. Many SNPs do not produce observable physical changes inindividuals with affected DNA. However, even SNPs that do not themselvesalter protein expression or play a role in a pathenogenesis may beproximal to deleterious mutations on a chromosome. It is thought that 85percent of exons in the human genome are within 5 kb (kilobases) of thenearest SNP. Because of this proximity, SNPs may be shared among groupsof people with harmful, but unknown, mutations and the SNP may serve asa marker for the mutation. Such markers help reveal the mutations andaccelerate efforts to find novel targets for diagnostic and therapeuticintervention, and may help in personalizing drug regimens by allowing asignature profile representative of a patient's tolerance to beinterpreted prior to beginning a treatment. R. Sachidanandam, et al.,409 NATURE 928 (2001). Inclusion of SNP data in the formation ofcoherent data sets, along with other data types, has the potential tosignificantly improve identification of new signature profiles fordisease staging and personalizing drug regimens. SNPs may also play asignificant role in the investigation of haplotypes, a combination ofmany neighboring SNPs on a single chromosome. Haplotyping may yield moreinformation about the genotype-phenotype relationship than individualSNPs.

[0072] Still another type of technology useful in the methods andsystems of the present invention is proteomics. Proteins play animportant role as structural and functional components of cells and bodyfluids of living organisms. Proteomics involves the identification ofproteins in cells or tissues and their role in physiological function,enabling identification, as well as quantification, of tens of thousandsof proteins present in biological samples. Since the total number ofproteins expressed in an organism is encoded in its genome, one aim ofproteomics is to correlate gene sequences to proteins, and hence toelucidate the function of various genes. The production or suppressionof proteins in tissues or cells in response to external stimuli providesan important insight into gene regulation. Proteomic studies can bedesigned to shed light on the mechanism(s) by which a drug or pesticideacts, as well as provide information regarding various side effects thatmay be associated with its administration. Relative comparison ofprotein profiles from normal and diseased tissue may represent proteinsthat are potential targets for pharmaceutical or agricultural discovery.An understanding of mechanisms occurring at the molecular level isimportant to designing effective drug therapies, or in determining thefunction of genes with agricultural importance. In one embodiment of thepresent invention, proteomics-derived data are contained in a coherentdata set to provide an improved understanding of the relationshipbetween genes, proteins, and function.

[0073] In one embodiment, the methods and systems of the currentinvention provide ways of combining biologically disparate data for thecreation of coherent data sets that serve as models of biologicalsystems. Biologically disparate data are data derived from differentindicators of the biological status of an organism or individual. Theseindicators include DNA, RNA, proteins, metabolites, and phenotypes, asshown in FIG. 1. The resolution power of coherent data sets promises tobe enormous, as not only can different types of data from a singleorganism be combined and directly compared for improved representationof an entire biological system or organism, but data from completelydifferent organisms can be analyzed together in a coherent data set forsimilarities and differences. This may be prove to be very valuable inthe pharmaceutical arena, for instance, where the effect of a drugcompound on both the pathogen and the host can be analyzed and compared(see Specific Examples 5 and 7, infra).

[0074] In the methods and systems of the present invention, data areacquired in a manner that facilitates the formation of coherent datasets as models of biological systems that are applicable to manydifferent areas of the life sciences industry. Identification of noveltargets for drug, pesticide, and nutriceutical applications is ofprimary importance. In the pharmaceutical arena alone, it is estimatedthat existing drugs interact with fewer than 500 biological targets outof an estimated 10,000 potential ones. Based on this estimation, asignificant majority of potential drug targets remain undiscovered. Inthe field of agricultural crop protection, only 20 distinct sites ofaction for herbicidal compounds have been discovered and reported in thepast 60 years, even though estimates of potential herbicide targetsexceed this number by two orders of magnitude.

[0075] A key component of applying genomics tools to target discovery isthe collection of functional information on how genes and gene productsimpact cells, tissues, organs and their associated healthy and diseasedstates. While biologically disparate data are being collected andanalyzed categorically to address target discovery, the presentinvention provides a method for combining the disparate data intobiologically meaningful groupings to create a data set that describes acondition in greater detail than that achievable through a collectiveanalysis of its individual components.

[0076] After new targets for drug, pesticide, and nutriceuticalapplications are identified, there remains a long and difficult processfor the development of an effective product aimed at the identifiedtarget, as shown in FIG. 2. Using the pharmaceutical field as anexample, an average of 10,000 lead compounds must be tested inpre-clinical development for every one drug that is ultimately marketed.The methods of the present invention maximize efficiency in bringingtargets to product development. In one embodiment of the invention,coherent data sets are created from disparate data. By using dataderived from multiple biological indicators of physiological status,compelling targets can be more thoroughly validated and optimized forgreatest effectiveness.

[0077] Another area of primary importance in the life sciences industryis the identification of novel lead compounds for use in drug,pesticide, and nutriceutical applications. The methods and systems ofthe present invention allow biological samples to be screened usingmultiple technologies, providing for the simultaneous examination ofdisparate indicators of biological status, so that the effect of aparticular chemical compound on a sample can be understood morethoroughly than was historically possible. Creation of coherent datasets allows subtle and complex effects to be observed so that target andlead compound identification, validation and selection are moreefficient. The optimization of lead compounds is more efficient as well,as it is possible to optimize the application of the selected leads, andscreen-out selected leads based on parameters such as toxicity. Themethods and systems of the present invention allow for the developmentof highly efficacious products while spending as little time and moneyas possible at a discovery stage.

[0078] Discovering and developing new pharmaceutical drugs has becomeincreasingly expensive and challenging. According to the Tufts Centerfor the Study of Drug Development, the cost of developing a single newdrug and bringing it to market (including failures) now exceeds $800million in the United States. The length of time from the discovery of acandidate to its approval by the FDA has increased from eight years inthe 1960s to more than 14 years at the time of this filing. Adversetoxic side effects from drugs result in more than two millionhospitalizations each year and more than 100,000 deaths. The methods ofthe present invention lower the cost of drug discovery, decrease thetime to market for new drugs, lower the incidence of adverse toxic sideeffects, and complement other genomics tools to help researchers betterunderstand the link between cellular or biochemical function,pharmaceutical compounds, toxicity, and disease response. The presentinvention is also applicable to the discovery and development of newpesticides and nutriceutical products, by lowering the cost ofdiscovery, decreasing the time to market, and lowering the incidence ofadverse side effects.

[0079] In one embodiment of the present invention, promisingpharmaceutical or pesticidal compounds that have failed to reachcommercial production due to toxic effects are studied using coherentdata sets to determine precisely the origin of the toxicity. Armed withinformation from a coherent data set, it is possible to rescue a faileddrug or herbicide compound, or to use coherent data set-derivedinformation to select a similar candidate more likely to succeed as amarketable product. The large sums of money invested in the developmentof failed compounds are not lost and can still result in an effectiveand marketable product.

[0080] The methods and systems of the present invention are useful forcompiling health or wellness profiles for organisms and for providingprofiles representative of particular diseases or other specificphysiological states. Profiles generated by methods of the presentinvention are composed of data from a single indicator of physiologicalstatus, or from any combination of such indicators. Data obtained froman individual are compared to a baseline, or reference population, todetermine physiologic status. It is understood that a baseline, acontrol, a reference, and a standard are used as equivalent terms inreferring to the present invention. Baseline populations, for example,consist of data from individuals of a particular group, such as healthyor normal individuals, or individuals diagnosed as having a particulardisease state or other physiological state of interest. This featureallows scientists to choose the types of data most informative for aparticular condition and representative of an individual's state ofwellness, referred to herein as a signature profile.

[0081] In one embodiment of the invention, a method is provided forestablishing a signature profile indicative of the physiological statusof an individual. The method comprises entering a unique identifier ofat least one biological sample into a computer tracking system; storingin the computer tracking system data from the sample, wherein the dataare linked to the unique identifier. The linked data are compared to areference and the most informative of the compared data are determined,wherein the most informative data are a signature profile indicative ofphysiological status.

[0082] In another embodiment of the invention, a method is provided forestablishing a signature profile indicative of the physiological statusof an individual. The method comprises entering a unique identifier ofat least one biological sample into a computer tracking system; storingin the computer tracking system metabolite data from the sample, whereinthe data are linked to the unique identifier. The linked data arecompared to a reference and the most informative of the compared dataare determined, wherein the most informative data are a signatureprofile indicative of physiological status.

[0083] In an alternative embodiment of the invention, signature profilesindicative of physiological status are established by integration ofdisparate data and formation of coherent data sets according to themethods and systems of the present invention. The method comprisesentering a unique identifier of at least one biological sample into acomputer tracking system; storing in the computer tracking systemdisparate data linked to the unique identifier; converting the linkeddisparate data to a numeric format; and converting the numeric formatdata to a common unit system. The method further comprises determiningthe most informative of the common unit system data, wherein the mostinformative data are a signature profile indicative of physiologicalstatus. The disparate data of the invention include, but are not limitedto, RNA data (for example, gene expression data), phenotypic data(visible or diagnostic trait), metabolite data, protein data (such as a2D gel), or DNA data (such as SNP information).

[0084] Another embodiment of the invention provides a method forestablishing a signature profile indicative of the physiological statusof an individual comprising entering a unique identifier of at least onebiological sample into a computer tracking system; storing datacomprising metabolite data in the computer tracking system, wherein thedata are linked to the unique identifier; converting the linked data toa numeric format; and converting the numeric format data to a commonunit system. The method further comprises determining the mostinformative of the common unit system data, wherein the most informativedata are a signature profile indicative of physiological status. In arelated embodiment of the invention, the data comprise metabolite dataand at least one other type of data. In another related embodiment ofthe invention, the data comprise metabolite data and at least two othertypes of data.

[0085] In futher embodiments of the invention, a signature profileconsists of one type of data, such as RNA data (for example, geneexpression data), phenotypic data (visible or diagnostic trait),metabolite data, protein data (such as a 2D gel), or DNA data (such asSNP information). In another embodiment of the invention, a signatureprofile consists of two types of data, such as RNA data and phenotypicdata, or RNA data and metabolite data, or any paired combination of theabove. In another embodiment of the invention, a signature profileconsists of three types of data, such as RNA data, metabolite data, andphenotypic data, or any three-way combination of the above. In anotherembodiment, a signature profile consists of four types of data, such asRNA data, metabolite data, DNA data and phenotypic data, or any four-waycombination of the above. In another embodiment, a signature profileconsists of five types of data, such as RNA data, metabolite data, DNAdata, protein data and phenotypic data, or any five-way combination ofthe above. In yet another embodiment, a signature profile consists of aplurality of types of data.

[0086] The most informative data is the data most informative for thephysiological state of interest. The most informative data is, forexample, but not limited to, data exhibiting the most statisticallysignificant change as compared to a baseline, or is data exhibiting themost unusual or unique characteristics, or the characteristics which aremost predictive of, or most often correlate with, the physiologicalstate of interest. The most informative data may, for example, be agroup of relatively small changes in physiological state, rather thanone large change. A powerful feature of the signature profiles of theinvention is that there is no upper limit on the number or types of datathat can be incorporated into the profiles, thus allowing vastly morecomplex, and more representative, signature profiles to be generatedthan has been previously possible. Another feature of the signatureprofiles of the invention is that, because the methods of the inventionmay be applied iteratively, a signature profile for a particular use,such as diagnosis of a disease state, or identification of exposure to atoxin, can continue to be refined and improved as more data iscollected. The addition of more data does not necessarily lead to anenormously complex signature profile, with many data measurements.Rather, in one embodiment, it leads to reduction of the data andidentification of the most valid indicators of a particularperturbation.

[0087] Various embodiments of the invention provide methods and systemsfor the development of, for example, signature profiles for diagnosingphysiological states, including disease stages, and for providing aprognosis of a disease state, thereby determining which therapeuticprogram to apply. A physiological state of an individual is thenmonitored to determine whether the therapeutic program as applied isproviding a return to a desired state. If not, or if undesirable sideeffects are observed, the therapeutic program is adjusted to improve itsefficacy. The individual is monitored throughout the treatment/diseaseprocess, so that the therapeutic program is a dynamic, iterative processthat is constantly adjusted or fine-tuned to suit the individual'sneeds. Further embodiments of the invention provide methods and systemsfor the development of signature profiles useful as indicators ofexposure to particular chemical or environmental toxins.

[0088] A database of endogenous metabolites for analysis of biologicalsamples is useful in determining an individual's physiological state.The present invention provides methods and systems for creating adatabase of endogenous metabolites that provides information pertinentto biochemical pathway designation and disease or phenotype associationfor compounds of interest, and provides data useful in a coherent dataset. As illustrated in FIG. 3, a nominated compound is examined by oneor more metabolite analysis method(s), also called spectral analysismethods, such as liquid chromatography (LC), high-pressure liquidchromatography (HPLC), mass spectroscopy (MS), hyphenated detectionmethods such as MS-MS or MS-MS-MS, gas chromatography (GC), liquidchromatography/mass spectroscopy (LC-MS), gas chromatography/massspectroscopy (GC-MS), Fourier transform-ion cyclotron resonance-massspectrometer (FT-MS), nuclear magnetic resonance (NMR), magneticresonance imaging (MRI), Fourier Transform InfraRed (FT-IR), inductivelycoupled plasma mass spectrometry (ICP-MS), and the like. Resulting dataare processed, characteristics of the compound are noted (for example,retention time, intensity, and mass), and information is stored in thedatabase. In addition to spectral characteristics, the database ofendogenous metabolites can contain any information or data pertaining tothe compound. This information can include, but is not limited to andneed not include, compound nomenclature and synonyms, chemicalstructure, molecular formula, molecular weight, Enzyme Commission number(EC #), Chemical Abstracts Service number (CAS #), vendor information,biological sample types in which the compound is found, enzymaticreactions and/or biochemical pathways in which the compound is involved,and disease states or phenotypic characteristics with which the compoundis associated. It is important to note that only one piece ofinformation is required for a compound to be eligible for entry into thedatabase of endogenous metabolites, so that, for example, as soon as aspectral peak is consistently observed or a compound is identified, itis added to the database. The database of endogenous metabolites isupdated, and information continually added as it becomes available, sothat linkage of compounds to gene function, biochemical pathways, andphysiological states becomes more complete over time. It is understoodto a person skilled in the art that any information from the database ofendogenous metabolites which is to be included directly in a coherentdata set must first be converted to a numeric format.

[0089] A database of endogenous metabolites is useful in linking datacontained in coherent data sets to enzymatic reactions and biochemicalpathways, and ultimately linking to associated diseases and/orphenotypes. It is generally accepted that metabolic responses of livingorganisms are altered by genetic makeup (or change), disease state,chemical (including therapeutic) treatment/insult, or environmentalinsult. An insult, as used herein, refers to an injury to an organism orone of its parts, or something that causes or has a potential forcausing such injury. Air pollution, for example, is accepted to be onetype of environmental insult. Other types of chemical and environmentalinsults to humans and animals include, but are not limited to, exposureto pesticides, exposure to industrial wastes, diet and changes therein,and weather changes. It is understood that although some types ofchemical treatment are intended to, and do, have positive effects in thetreatment of disease, the same chemical treatment may have detrimentaleffects as well. Other types of chemical and environmental insults toplants include, but are not limited to, exposure to pesticides, exposureto industrial wastes, exposure to temperature changes, exposure to lowlight conditions, exposure to changes in the amounts of nitrogen andphosphorous available in the soil, exposure to drought, exposure tosalinity changes in the soil, and exposure to too much moisture. Thus,the methods and systems of the invention are useful for understandingthe relationship between biochemical response and disease and/orphenotype association. As illustrated in FIG. 3, once any of the threeinformation fields of enzymatic reaction, biochemical pathway, ordisease or phenotype association is known, it is possible to link to theother information fields, thus maximizing the efficiency with which newcorrelations are made with research data. The database of endogenousmetabolites is a dynamic information source, meaning that moreinformation is entered into it as data becomes available, making pathwaycorrelations and linkages more complete.

[0090] While not typically associated with gene function, forensicsciences are important as a research field, especially in the area ofsuspect identification through analysis of biological evidence collectedfrom a crime scene. The methods and systems of the present invention areuseful in generating a wealth of information from a small sample size,which is typical of crime scene evidence, and allows meaningful analysisof the information through the formation of coherent data sets, leadingto more accurate interpretation of the data. This is useful not only inlinking suspects to crime scenes, but also, for example, in theidentification of unknown deceased individuals, determination oftoxicology involved in death, and determination of the specifics of drugor alcohol abuse when it is an element of a crime. Forensic pathologicaland toxicological results are complex and often difficult to interpret.The present invention improves the acquisition of useful data from crimescene evidence and the subsequent analysis of the data, makinginterpretation of results and presentation in legal proceedings moreefficient.

[0091] The present invention introduces coherent data sets as a way tomanage biologically relevant data by making them analyticallycomparable, including disparate data from different indicators of thebiological status of an individual or organism. Prerequisites forcreating a coherent data set are integrated data and a baseline valuefor each type of data used to measure various biological indicators. Inbiological experimentation, measured values reflect the sum of severaltypes of variation. A baseline, or reference, is needed so thatbiological variation can be distinguished from variation due toexperimental error. In the methods and systems of the invention, dataare converted to a common unit system relative to a control (thebaseline). A control, or reference, can be as typically thought of in ascientific experiment, wherein a rigorously controlled standard isincluded in an experiment. It can also be simply a measure of a sampleor group of samples of interest, such as a group of samples from humanswho are defined as healthy or having a particular disease state. Thenature of the reference depends on the type of information sought andwhat is most pertinent to that. It is accepted that a person skilled inthe art can determine an appropriate baseline or reference.

[0092] Coherent data sets can be vastly more informative andbiologically meaningful than data collected and analyzed from individualdata streams. The present invention provides tools to integrate data andto create coherent data sets that encompass data from multipleindicators of biological status. The invention also comprises tools foranalysis of coherent data sets to facilitate the identification ofproduct leads, determination of gene function; identification of productcandidates; production of a compilation of health or wellness profilesfor prognostic and diagnostic use; determination of compound site(s) ofaction; and identification of unknown samples, such as in a forensicsetting.

[0093] The methods and systems of the present invention are applicableto any organism or cell culture system and are flexible enough toaccommodate data from any combination of biological indicators. Tools ofthe present invention are provided in such a way that data fromadditional technologies or sources can be added as each is developed andadopted in a scientific community, or excluded as desired. It isunderstood that disparate data are derived from different indicators ofa biological status of an individual or organism. For example, differentphysiological indicators include DNA, RNA, proteins, metabolites, andphenotypes, and are measured using a variety of different technologicalapproaches such as, but not limited to, DNA sequencing, gene expressionanalysis, 2D gels, mass spectrometry, NMR, and direct measurement ofvarious phenotypic traits. Newly developed technologies are likely toimprove identification of gene function and product leads in a highthroughput environment and data from emerging technologies can bereadily incorporated into coherent data sets. The methods of theinvention are suitable for a broad range of applications in industry,government, and academia. With the present invention, the standard forthe generation of coherent data sets produces a system for highthroughput, automated data analysis to identify gene function and leadsfor product development. The invention further provides methods forcreating, managing, processing, and using coherent data setsspecifically for the purpose of predicting gene function and compoundsite of action, the results of which can lead directly to productdevelopment.

[0094] Current capabilities to generate integrated data are notsufficient and are oftentimes highly inefficient, resulting in a loss ofdata. FIG. 4 illustrates how the concept of coherent data sets shiftsthe focus from relatively simple gene identification schemes inintegrated data to a “rich annotation” that includes analysis fromcoherent data sets in addition to traditional annotation. It is helpfulto employ biological resources to validate functional predictions. Asvalidated predictions are added to the annotation database, the databasebecomes increasingly more valuable.

[0095] The present invention provides methods and systems that cangreatly improve the reliability and efficiency of gene functiondetermination and lead discovery, including enabling technologies suchas generic methods and tools to integrate data and to generate coherentdata sets. Modular tools can be utilized to efficiently analyze coherentdata sets, but are not necessarily required to generate coherent datasets. The present invention also provides methods and tools that enablethe efficient integration of data, and the creation and testing ofcoherent data sets to predict gene function independently of organism orcell culture system. The development of the methods of the presentinvention is an interdisciplinary project at the interface of biology,bioinformatics, and software engineering.

[0096] In one embodiment, the present invention uses real-time datastreams from biological experiments from multiple research technologies.The development of analytical tools for biological research often occurswithout sufficient input from biologists. Coherent data sets depend uponbiologists to validate predictions made with the tools described herein.This biology-dependent approach to the development of analytical toolshelps to strengthen and build the concept of coherence and prediction ofgene function.

[0097] Integrated data are a prerequisite to the development of coherentdata sets. With data streams from a variety of technologies increasingat an unprecedented rate, the problem of data overload is addressed by aricher annotation database that includes a wide range of information,including experimental results and inferential conclusions. Theannotation database is the “data to knowledge” link, a key to genefunction discovery. Data generating technologies currently in useinclude, but are not limited to, sequencing and annotation, metaboliteanalysis, gene expression analysis, and phenotypic analysis(morphometrics). Suitable biological systems include, but are notlimited to, plants, such as Arabidopsis (Arabidopsis thaliana) and rice,fungal organisms including Magnaporthe grisea, Saccharomyces cerevisiae,and Candida albicans, and mammals, including rodents, rabbits, canines,felines, bovines, equines, porcines, and human and non-human primates.However, it should be remembered that the methods and systems of thepresent invention are applicable to any biological system. Informaticstechnologies can include bioinformatics, laboratory informationmanagement systems (LIMS), software engineering, and informationtechnologies.

[0098] The organization of FUNCTIONFINDER technology is shown in FIG. 5.FUNCTIONFINDER technology (Paradigm Genetics, Inc., Research TrianglePark, N.C.) comprises four interrelated components: databases, dataprocessing, data analysis tools, and user interfaces. Data are extractedfrom a Refinery layer (REFN) and integrated in the Abstraction (ABST)layer. Public databases and other sources of relevant data areintegrated in the Abstraction layer with proprietary data generated“in-house.” Integrated data are used to generate coherent data that isstored in a relational database and subsequently extracted into coherentdata sets for efficient access by Discovery layer (DISC) tools.

[0099] Data are generated from a plurality of instruments and stored ina variety of media, such as proprietary databases, LIMS, flat files,Excel spreadsheets, and other electronic storage methods well known inthe art, and then loaded into an integrated database. For example, arefinery database can contain data related to soil samples, such asexperimental plants grown in a flat (container) of soil. Soil sampledata are stored in LIMS, and a computer program copies information fromLIMS into the refinery. Gene mutation data related to the experimentalplants is stored in a separate proprietary database. To populate therefinery, a computer program copies information from the proprietarydatabase to the refinery database. To ensure accurate and efficientintegration, integrity checking and enforcement steps occur as the dataare loaded to the refinery. Integrity checking and enforcement furtherensures that the data in the database are fully integrated, properlyidentified, and linked to all associated data. Data in the refinerybelong to, or are uniquely associated with, a measurement set, acollection of measurements related to an experiment. One aspect ofenforcing integrity is to ensure that each data point belongs to, or isassociated with, a measurement set. The integrated database stores datain a tree-like structure, so that a measurement can be linked to othermeasurements further up the tree, and measurements further down the treecan be linked to the integrated database. Integrity checking furtherensures that all upward links are present and valid when a data point isstored.

[0100] In one embodiment, the efficiency of data integration is improvedusing, for example, DiscoveryCenter software (Lion bioscience, Inc.,Cambridge, Mass.), including components for data integration at therefinery and abstraction layers, as well as components for presentationand analysis at the discovery layer. DiscoveryCenter includes DataMarts(mini data warehouses) for sequence, expression, and genotyping data andIBM's DiscoveryLink (IBM Corp., Armonk, N.Y.) technology for federateddata management. DiscoveryCenter uses DataMarts and DiscoveryLinktechnologies to concertedly address a wide range of data integrationneeds in life sciences research. FUNCTIONFINDER and DiscoveryCentercontribute components to support a comprehensive, integrated environmentfor gene functional analysis. One embodiment of the invention involveshaving a first research group or company generating complex integrateddata sets emanating from several technologies, including sequence andannotation, metabolite analysis, gene expression analysis, andphenotypic analysis, with a second research group developing dataintegration technologies spanning biological and chemical information togenerate flexible, integrated systems for gene function analysis.

[0101] An alternate embodiment of the invention supports, for example,two parallel approaches for identification of leads for pharmaceuticalor pesticide product development: 1) testing compound site of action,and 2) conducting genomic research (functional gene knock-outs). In agene knock-out experiment, the goal is to identify the function of agene that has been disrupted. In a site of action (SOA) experiment, agoal is to predict a site or process in a cell that is affected bytreatment with a specific compound. In either case, the approach is toperturb a biological system and then characterize the effect(s) of thatperturbation as completely and comprehensively as possible. The presentinvention provides coherent data sets derived from multipletechnologies/sources to further provide different views of the depth andcomplexity which characterize the status of a normal versus perturbedbiological system. Although the gene knock-out approach leads directlyto the identification of gene function, SOA experiments also contributeto an understanding of a biological system by providing information thatcan lead, indirectly, to identification of gene function. Accordingly,coherent data sets derived from SOA and genomic technologies may providesynergisms to gene function and site of action research.

[0102] The present invention provides methods and systems for theintegration of data from disparate sources. Broad initiatives like theHuman Genome Project generate data in quantities previously unavailableto the scientific community. Technology continues to advance the studyof biological and other systems to an extent that the technical capacityto generate, capture, and store data is outpacing the ability to analyzedata to a results-oriented endpoint. In recent years a number of newtechnologies have become available for generating data in life sciencesresearch. Advances in technology include, but are not limited to,high-throughput sequencing for expressed and genomic DNA, theidentification and sequencing of SNPs (single nucleotide polymorphisms),microarray experiments for measuring gene expression, varioustechnologies for measuring protein-protein interactions and proteinexpression, combinatorial chemistry, and high-throughput screening. Theaforementioned advances in technology, combined with more traditionaltechnologies such as phenotypic measurements and metabolite analysis,provide a broad range of technologies for generating data. Whileadvances in technology continue to provide the scientist with everincreasing data generation capacity, technology developers rarelyconsider the challenges of integrating certain technology types withexisting technology types to facilitate integrated analysis of theinformation available from the combined data streams. The presentinvention provides methods and systems for producing integrated systemsas the first step in creating and analyzing coherent data sets.

[0103] In order to support the creation and analysis of coherent datasets, proper technical infrastructure must be available. Appropriatecomputer hardware is supplied, for example, by the Sun Microsystems'E420 workgroup server (Sun Microsystems, Inc., Santa Clara, Calif.).Appropriate operating systems include, but are not limited to, Solaris(Sun Microsystems, Inc., Santa Clara, Calif.), Windows (Microsoft Corp.,Redmond, Wash.), or Linux (Red Hat, Inc., Raleigh, N.C.). Appropriatesoftware applications include, but are not limited to, relationaldatabases such as Oracle 9.0.1 (9i) (Oracle Corp., Redwood Shores,Calif.), DB2 Universal Database V8.1 (IBM Corp., Armonk, N.Y.), or SQLServer 2000 (Microsoft Corp., Redmond, Wash.), and software forstatistical analyses, such as packages available from SAS (SASInstitute, Inc., Cary, N.C.) or SPSS, Inc. (SPSS, Inc., Chicago, Ill.).In one embodiment, the server is the E420 workgroup server (SunMicrosystems, Inc., Santa Clara, Calif.), the operating system isSolaris (Sun Microsystems, Inc., Santa Clara, Calif.), and the softwareis Oracle 9.0.1 (9i) (Oracle Corp., Redwood Shores, Calif.), andstatistical software is from SAS (SAS Institute, Inc., Cary, N.C.).

[0104] Each research technology presents unique integration challenges.Some research technologies produce data that reside in-house within aresearch organization, while some research technologies produce datathat are located externally on the Internet. Data may be stored inflat-files on a local file system, in relational databases, in objectdatabases, or on web servers. Since there are very few acceptedstandards in the bioinformatics industry, file formats, databaseschemas, and software interfaces are highly varied and difficult toreconcile. Vocabulary and nomenclature are not exceptions to the lack ofstandards. It is not uncommon, for example, for a single gene to havemultiple names in multiple contexts with no simple mechanism for mappingthem together or distinguishing one from another.

[0105] It is useful in data integration to employ relational andobject-oriented database design, data warehousing, federated databasesystems, normalized and de-normalized schema design, pre-processing, andother techniques to produce high-performance, highly extensible, dataintegration systems. One approach to addressing data integration isdeveloping powerful and flexible software and database components tointegrate and manage data generated from multiple sources. For example,a flexible combination of data warehousing and federated databasesystems is used to balance performance with flexibility in a rapidlychanging environment.

[0106] Those skilled in the art can participate in the development andadoption of ontologies for life science research and help standardizethe current widely disparate vocabularies. A standard vocabulary is veryhelpful, not only for integrating external sources of gene function datathat can be used as part of an analysis, but also for representing theresults of efforts to identify gene function. The nomenclature andontology portion of the database of endogenous metabolites (FIG. 3)utilizes standardization efforts as applicable. Using the presentinvention, one skilled in the art can investigate and developrepresentations for modeling functional information that facilitatesqueries and inferences regarding gene function. Current laboratoryinformation management systems (LIMS) can be expanded into alltechnologies so that data pertaining to a unique identifier is reliablytracked. Defining components in LIMS as the samples are processed vastlyimproves the efficiency by which data are integrated in comparison tocomponent definition subsequent to data generation and storage.

[0107] The methods and systems of the present invention provideeffective ways to manage large amounts of information as is required tocreate coherent data sets. In one embodiment of the present invention, amethod for creating coherent data sets comprises an integrated data setcontaining disparate data, such as sequence data, gene expression data,metabolite data, and phenotype information.

[0108] A first step in processing disparate data is to create aninventory of types of information requiring integration. In addition tosequence data, gene expression data, metabolite data, and phenotypeinformation, additional types of information include, but are notlimited to, 3-D protein structural analysis, protein expression,biochemical pathways, genotypes (including polymorphisms), SNPs(including haplotypes), and scientific literature. The identificationstep involves working with scientists to determine the types of datathat contribute to the knowledge of gene function. A second step inprocessing disparate data is identifying the specific sources of eachtype of information and the specific integration challenges for each.For example, one may determine that the GenBank database (NationalCenter for Biotechnology Information, Bethesda, Md.), the SWISS-PROTdatabase (European Bioinformatics Institute, Cambridge, UK), and anorganization's in-house sequence repository are the key sources ofsequence annotation data.

[0109] By implementing an embodiment of the present invention, oneskilled in the art can then determine the location of the informationand the technology necessary to access it. For example, GenBank andSWISS-PROT are available on the Internet and accessed through a WorldWide Web connection, while an in-house sequence repository is usuallylocated in-house, such as an in-house repository stored in a relationaldatabase on a central server. As such, in an alternate embodiment of thepresent invention, a set of components are utilized for downloading,processing, and storing GenBank and SWISS-PROT sequence data andannotations associated therewith. Specific data sources required tocomplete the process and locations of the same are determined byinterviewing scientists and bioinformaticians, with ongoing efforts toremain current with the state-of-the-art.

[0110] Data integration systems of the present invention are designed tohandle the types and sources of data that are identified in the firsttwo steps as described above. For example, data warehousing, federateddatabase management, text indexing, precomputation, and severalinnovative technologies are combined to form a robust, flexible, andpowerful data integration system, comprising a third step of the presentinvention in processing a broad range of data from a plurality ofsources. The third step utilizes an iterative design and review processwhereby software engineers and scientists collaborate on the design ofthe system.

[0111] A fourth step in processing disparate data is the construction ofa data integration system based on designs produced in the previousabove-described steps. Construction involves implementing software anddatabases to fulfill specific requirements, typically specificationsfrom software engineers, with support from project management andtesting resources, as well as consultation from domain experts.

[0112] A fifth step in processing a broad range of data from a pluralityof sources is the integration and representation of gene function data.The expressive power of vocabularies and ontologies currently in usewithin the scientific community are evaluated to describe gene function.Ontological terms are applied to the results of biological studies, suchas site-of-action (SOA) studies, to determine whether the terms areexpressive and exacting enough to describe the gene function data thatis inferred from coherent data sets. An initial ontological assessmentprovides a starting point for a process of refining and standardizing avocabulary of gene function that proceeds in iterative cycles throughoutthe duration of a project. At each iterative stage of refinement, thevocabulary is applied to integrate external sources of gene functiondata and gene functions identified by ongoing analysis of coherent datasets. The kinds of statements used to characterize gene function arebased on the analysis of coherent data sets. Development of datarepresentations for gene functions are used to query and apply theinformation produced.

[0113] The requirements for the LIMS employed with the integration ofdata for the present invention are carefully identified and implemented.LIMS are employed in most research organizations and are generallywell-known in the art to facilitate data capture and storage, typicallyallowing the automation of many routine data management and processingtasks. Unfortunately, each research technology and data type usually hasits own specific LIMS, and LIMS from different technologies do notcommunicate well with one another. Tools for integrating multipletechnology-specific LIMS into a common framework include key componentsof the data integration system of the present invention. A suite oftools is developed by those skilled in the art for managing data comingfrom each type of LIMS, and modules are developed for moving databetween the suite of tools. Data vehicle modules can validate data onboth the sending and receiving sides, following common LIMS rules forsample handling throughout. Alerting mechanisms are provided to bringerrors to a user's attention and to protect data integrity.

[0114] Once the data integration system is in place, the efficiency ofthe integrated data is measured. Two primary metrics are used to measurethe efficiency of the data integration systems: 1) time savings providedto downstream users of the system by having integrated data versusworking with the data in an unintegrated manner; and 2) the timerequired to integrate additional data sources into the system. Measuringthe time savings from having integrated data requires a comparisonbetween a user performing an operation in the integrated system versusperforming the same operation on data that has not been integrated. Inthe unintegrated case, the user must look up all of the relevantinformation in each of the data sources individually, then integrate theinformation by manually entering it into a report or an analysis tool.If the number of data sources or the size of the data set is large,manual entry can be extremely time-consuming. Integration systems soldby a vendor, such as Lion bioscience, can reduce the effort required topull together large amounts of disparate data by as much as severalorders of magnitude. In some extreme cases, weeks of work in anunintegrated system can be reduced to mere minutes of work in anintegrated system.

[0115] Manual integration of data from different technologies requires agreat deal of manual integration effort, in the order of hundreds ofhours for a relatively small experiment, and up to thousands of hoursfor a larger data set. Time required to integrate data is reduceddramatically by developing tools and data structures to efficientlyintegrate multiple data sources in a repeatable fashion. The time andeffort required to integrate a new data source into the system isimpacted by data source size, complexity, and similarity to previouslyintegrated data sources. Larger data sets require more engineeringeffort to design a scalable solution, tune performance, and to implementbackup and recovery strategies than do small data sets. More complexdata structures (such as sequence annotation) require a great deal moredesign work to integrate than do simple data structures or datastructures which are fairly easy to reduce to a simple format (such asgene expression data). Finally, it is usually much more straightforwardto integrate a new data source that is very similar in structure to adata source that has already been integrated, e.g., integrating sequencerecords from the EMBL database (European Molecular Biology Laboratory,Cambridge, UK) after GenBank sequence records have been integrated.

[0116] One aspect of the data integration system of the presentinvention is to enable integration of previously non-integrated datasources. The present invention provides a system that is fully scalable(i.e., handles a range of data sizes), handles complex data structures,and facilitates integration of a new data sources similar to subsistingintegrated data sources. User time required to integrate each new datasource in operator-hours, taking the size, complexity, and similarity ofthe data source to subsisting integrated sources into account, is thenmeasured. Thus, the overall time required to integrate previouslynon-integrated data sources decreases over time in the integrationsystem of the present invention.

[0117] Once the data are integrated, the creation of coherent data setsoccurs. A coherent data set is an integrated data set that istransformed through a series of protocols and statistical analyticalmethods to create a comprehensive data set. Consequently, data frommultiple indicators of biological status are compared to one another andanalyzed using the same tools or suite of tools. A coherent data set (orgroup of coherent data sets) creates a biologically relevant, virtualmap of cellular processes. Coherent data sets are vastly moreinformative than integrated data from individual data streams foridentifying gene function and other leads for product development.

[0118] In one embodiment of the invention, a biological system isperturbed and the effects of that perturbation are characterized ascompletely as possible. To quantify the changes due to the perturbation,all measurements are compared to corresponding data from experimentalcontrols (the baseline or reference). In any biological experiment,measurements reflect the sum of several types of variation. Variationmay be due to natural biological variation, experimental processvariation, and variation that is a result of the perturbation of thesystem that is the focus of the experiment. A baseline is a profile ofmeasurements associated with a control. Use of the baseline is necessaryto account for variation due to an intentional perturbation of thesystem and its precise inflection or deflection from the control.

[0119] To establish a baseline, sufficient control experiments arecarried out to provide an understanding of the biological andexperimental variation inherent in the technology. Establishing abaseline, that is, collecting data from control experiments thatcorrespond to all types of measurements taken, makes it possible totransform all kinds of data formats to a common presentation. At a basiclevel, a coherent data set consists of a set of measurements that haveall been standardized to a common (or commonly relevant) baseline. Forexample, all measurements could be expressed as a number of standarddeviations above or below the mean of a baseline control. Establishing abaseline for each type of measurement makes it possible to weight eachmeasurement with an appropriate level of sensitivity. That is, if thecontrol shows very little variation for a particular type ofmeasurement, then a relatively small difference in that measurement typecan be significant. If the control varies widely for a particular typeof measurement, then only relatively large differences in thatmeasurement type may be significant.

[0120] The prerequisites for creating a coherent data set are integrateddata and a baseline, or standard for each measurement type. In aresearch technology wherein data are collected for long periods of time(i.e. years), each set of baseline data potentially may possessdifferent distributional parameters. That is, due to inevitable changesin any number of factors, growth environment, laboratory practices, rawmaterials, etc., a plant grown during one period may not be directlycomparable to a plant grown a year prior to that period or,alternatively, a plant grown a year following that period. Therefore,strict guidelines are implemented to provide quality control withinbaseline measurements and to maintain the integrity of the baseline.

[0121] Methods and systems of the present invention were used to createa coherent data set with a relatively small but reasonably complexintegrated data set from a herbicide SOA experiment in which 18compounds were examined. After validating coherence for the SOA dataset, it was expanded and coherence was reestablished, and a larger andmore complex integrated data set describing 65 mutants (functional geneknock-out data) in Arabidopsis was added to the SOA. After establishingcoherence for the expanded data set, the process was scaled and appliedto even larger data sets that describe 600 or more Arabidopsis mutants.The process for developing coherence for each integrated data sets islargely iterative, so that with each new project, the creation ofcoherent data sets becomes increasingly straightforward.

An Integrated Data Set

[0122] Initially, integrated data from a small, well-defined compound(herbicide) site of action (SOA) experiment in Arabidopsis was used, asmentioned above. The integrated data comes from three data streams: geneexpression analysis (GEA), phenotypic analysis, and metabolite analysis.Several of the tasks relating to the creation and testing of a coherentdata set are repeated using larger and more complex data sets as moredata and information become available. The creation and testing cycle isan iterative process.

[0123] Following the establishment of a baseline, methods are developedand automated to monitor changes in the baseline. Monitoring methods aresimilar to some types of automated quality controls that detect changesin the location or variation of a response. One skilled in the art canbegin monitoring changes in the baseline by adapting quality controlmethods and exploring their suitability. Ideally, baseline-monitoringmethods are largely data-driven. Alternatively, one can explore the useof methods based on external data (e.g. data from a temperature monitor,or from a LIMS system) that may indicate or identify baseline shift. Inaddition, one can utilize an algorithm for estimating the size of“windows” of data that share a common and stable baseline. Such analgorithm is useful in planning budgets for laboratory procedures.

[0124] Standard quality control measures in combination with a varietyof decision rules are evaluated, process error rates are compared, andminimum sets of decision rules are developed. A number of commonly usedrule sets are used. However, the false-positive and false-negative errorrates of all rules sets work against each other. That is, if the ruleset is larger than necessary, then (even if every rule is sound if usedindependently) the result can be an inflated false-positive error rate.Thus, the optimization of the rule set is performed by statisticians whocan develop custom rule sets as needed.

[0125] Historical, known changes in a research technology are used totest the rule sets and to assess the process error rates. Duringdevelopment, many documented systematic changes are typically made to aresearch technology. A number of changes can affect the output ofresearch technologies. This information can be used to test rule setsand assess their process error rates. For example, by developing ahybrid system that considers quality control-like decisions, but alsouses external information about the laboratory procedures to makedecisions, a system can determine whether it performs its function morerobustly. A purely data-based decision system can be improved byutilizing information about changes in suppliers, materials, laboratoryprocedures, or the like. Development and testing of data-based methodsfor estimating “window size” for a stable baseline is also a usefulapproach.

[0126] Each quality control step is computationally intense. To addressproblems efficiently, the prototype data set is kept small, and thedependent variables screened to locate a small set that is known to besensitive to changes in the experimental environment. Once a promisingstrategy is developed, it is tested and validated for the next, largerset of dependent variables.

Processing Integrated SOA Data: Toward Coherence

[0127] Each data measurement collected is standardized to a control orreference. If no matched control exists, then a similar control issubstituted, the experiment repeated, or the data excluded. Data can beselected for comparability to compound concentration and response timesaccording to baseline experiments. Using this data set, automatedmethods for standardizing data are developed. In one embodiment,algorithms are explored for transforming data to approximate normalityand/or common variance before standardizing. In another embodiment,distribution-free methods for expressing measurements on a common scaleare also explored. Such distribution-free methods are widely applicablebecause they do not depend on normality, constant variance, or otherassumptions that may or may not hold true for a given set of dataderived under process conditions that are monitored and evaluatedagainst established process error models.

[0128] Standard algorithms are developed for transforming data tonormality with constant variance. In one theory, any distribution can betransformed to a normal, or Gaussian, distribution. In practice, and fora given set of data, finding the right transformation can bechallenging. Computer algorithms exist for suggesting an appropriatetransformation. Algorithms also exist for suggesting avariance-stabilizing transformation. Sometimes these two transformationsare the same (or similar), while in other instances a transformationthat solves one problem makes the other worse. On the other hand, one ofa small number of transformations often helps greatly, even though itmay not be the “analytically correct” choice. Such transformations areassessed for how effective, and efficient in computer processing time,they are for managing process variation and how they affect theinformative value derived from the inherent biological variation in thesystem.

[0129] Distribution-free methods are assessed for expressing data on acommon scale. Distribution-free methods based on ranks, medians, orinterquartile ranges are commonly used, and are often found to be nearlyas powerful as standard methods applicable to a wider variety of datatypes. The two-sample location and dispersion tests suggest methods foradjusting data sets to a common location and/or spread. In addition, theusual standardization techniques are adaptable to more robust statistics(such as the median and interquartile range) in a statistically soundmanner. Small integrated data sets are readily developed through the useof these methods. The integrated data set is screened and a fewvariables are chosen that are clearly non-normal and have non-constantvariances. By focusing on a small set of “least favorable” variables,the quickest and most robust results are achieved. Methods developed inthis way that show promise are tested and verified on a larger variableset.

[0130] Data that are not normally distributed can be transformed to anormal or Gaussian distribution. For example, GEA and metaboliteanalysis data are not normally distributed, but appear much more soafter being converted to a logarithmic scale. The conversion step isimportant in that many statistical analyses behave more reliably onnormally distributed data. A caveat to conversion is that some data setsmay not be readily transformed to a normal distribution. In such cases,“robust” analysis methods are used that do not rely on an assumption ofnormality, and may work reasonably well even if the data set is notnormally distributed. Key characteristics of a coherent data set arewhether the data can be transformed to normality and whether assumptionsof normality will be necessary.

[0131] Values are assigned to all potentially valuable datameasurements. Metabolite analysis and GEA technologies have upper andlower limits of detection. If a data point falls outside of the limit,then no value is assigned. To avoid the loss of data and to create amore representative data set, values are assigned in cases where a datapoint falls outside of a predetermined limit. Compounds with known sitesof action assist in clarifying if the assignments are not appropriateand modifications are made accordingly.

[0132] Selection of significant data depends on the amount ofvariability in the baseline control. In the herbicide SOA experiments,data that did not differ significantly from the standard by at least twostandard deviations (corresponding to a 95 percent probability based ona normal distribution) is excluded. The determination of what data isconsidered to be significant can be changed and tested empirically forany given data set.

[0133] To establish coherent data, a degree of confidence is requiredwith respect to data from all technologies contributing to anappropriate extent. Quantitative discrepancies of data from eachtechnology are weighted to ensure adequately reflective analyses. In ahuman genomics study, GEA can provide data for all (estimated) 35,000genes, and state-of-the-art technology in metabolite analysis couldprovide data for up to 500 or more metabolites. The significantquantitative differences in the amount of data generated from differenttechnologies is accounted for to ensure that possible qualitativevariations do not adversely affect coherence.

[0134] Data are assayed for coherence. The data are analyzed using avariety of multivariate analyses, applied appropriately by one skilledin the art. For example, the compounds are clustered based on thephenotypic data, and then are reviewed to determine whether they exhibitsimilar profiles when viewed in light of multicomponent metaboliteanalysis data and/or gene expression data.

[0135] Several statistical methods are used to test a coherent data set.For example, cluster analysis is performed and hypotheses formulatedbased on the results of the clustering. A well-designed cluster analysiscan provide information leading to the identification of gene function,as genes that cluster together in this type of analysis may infersimilar function. FIG. 6 illustrates an example of cluster analysisperformed on phenotypic data. Additional analyses can be carried out todetermine whether the hypotheses are valid. In one embodiment, astatistician visually evaluates cluster analyses and evaluates whether acoherent data set yields an expected result. If the result isinconsistent with that which is expected, each of the process steps isreevaluated.

[0136] If the results of the various analyses are consistent withexpectations, a score is derived based on how close to ideal (normallydistributed with constant variance) the data set is. This is taken underconsideration together with a score that reflects the size andcomplexity of the data set. These scores make it possible to follow theprogress of coherent data set development.

[0137] Once a coherent data set is established and validated, moreinformation can be added and the set re-validated in an iterativeprocess. For example, in the herbicide SOA experiment, the baseline wasexpanded by adding 100 additional compounds with known sites of action.The data was expanded by adding similar data from a different organism,for example a microbe. Data corresponding to the effect of theabove-referenced 18 compounds on one or more microbes was provided as auseful data set for creating and testing coherence.

[0138] In one embodiment of the invention, a second integrated data setis used to create a coherent data set describing, for example, 65Arabidopsis mutants with functional gene knock-outs. The data are fromthree data streams/biological indicators: gene sequencing andannotation, metabolite analysis, and phenotypic analysis. The largerdata set is processed through one embodiment of the methods of theinvention, that is, the data are standardized, transformed to a Gaussiandistribution, numerical values are assigned, significant data areselected, and the data are weighted, or balanced. As with the smallerherbicide SOA data set, the data from the 65 mutants are then assayedfor coherence by applying multivariate analyses and predictions,additional analyses are performed, hypotheses are validated, andcoherence score and metrics are calculated.

[0139] Methods of the invention are scalable for creating and testingcoherent data sets. Scaling includes repeating all of the methods of theinvention described above for a larger integrated data set. For example,an integrated data set with 600 gene knock-out mutants is suitable as alarge data set. In a particular embodiment, the data are from threedifferent technologies: sequencing and annotation, metabolite analysis,and phenotypic analysis. In addition, other data sets and improvedmethods for integrating data are available to use in combination withthe 600 gene knock-out mutants, creating an even larger data set.Preferably, most of the work to create coherent data sets is automatedto produce a first-pass coherent data set that is reviewed through auser interface by a statistician who can input refinements to theprocess.

[0140] The methods of the present invention further provide steps thatinclude multiple computational and analysis steps for producing acoherent data set. A number of analysis tools are developed or adaptedfor use in specific research technologies, including a standard suite ofsequence analysis and comparison tools, such as, but not limited to,BLAST, Smith-Waterman, and Hidden Markov Model (HMM) searches. Inaddition, a standard suite of sequence analysis and comparison toolswill likely include an open reading frame (ORF) prediction programcalled ESTscan. For metabolite analysis, there is Target DB (ThermoElectron Corp., Waltham, Mass.), a chromatographic database and analysistool, that houses data on metabolite levels in plant tissues, performsautomated quality control on the data, and aids in identifying unknowncompounds. Additional analysis tools can be written using SAS(Statistical Analysis Software, SAS Institute, Cary, N.C.) to performadditional and more sophisticated analyses (such as discriminantanalyses) and 2-D and 3-D visualization of metabolite analysis data.

[0141] There are also a number of SAS modules that operate on phenotypicdata. These modules perform automated quality control and providevisualization for numeric and descriptive phenomic measurements. Inaddition, a number of SAS modules are developed that perform a varietyof multivariate analyses and present tools for data visualization. Thesemodules include a principle components and factor analysis module; aphenomic clustering module; and a discriminant analysis module, forapplications, for example, to a plant phenotyping process. Other toolsand databases are available for sequence, genetic, and gene expressioninformation. Expertise is useful for integrating public domain andcommercial analytic and visualization tools with open, extensibleintegration systems.

[0142] In theory, analysis of a coherent data set should provide newinformation not available by separate analysis of the individual datastreams that contributed to the coherent data set. However, in creatinga coherent data set, a multidimensional space is defined that is notoptimal for analysis. One of the most daunting problems that must beconsidered when designing the analyses is the multidimensionality of acoherent data set. That is, as the number of dimensions (data streams)increases, the data that populates that “data-space” becomesincreasingly sparse. This situation makes it difficult to draw relevantconclusions from cluster or other types of analyses. There are twosimple approaches to solving this problem: increase the amount of datacollected to populate the space, or find ways to reduce thedimensionality of the data to obtain relevant results from analyses. Inpractice, increasing the amount of data in many cases is often noteconomically viable, so an alternate preferred approach in many casesmay be to reduce the dimensionality without losing information.

[0143] In one embodiment of the present invention, the dimensionality isreduced by selecting certain data sets for “pre-treatment,” for example,by calculating the correlation between complex profiles and then usingthe correlative data rather than individual profiles in furtheranalyses. Technology specific analysis tools for are commerciallyavailable, but considerable effort is required to manipulate the outputfrom any one tool and use it as the input to an unrelated tool withoutcorrupting the data. For example, even when both tools are written inSAS, different software modules often require that data be in verydifferent formats. Furthermore, users trained to operate the analysistools are typically limtied to bioinformaticists and biostatisticians,and domain scientists rarely have access to the modules or theappropriate training. Finally, very little is known about the mosteffective ways to present and display highly multivariate results.

[0144] Gene function technology tools used in the methods of the presentinvention are preferably designed as modules. A research scientist canrequest an analysis without having to specify the format of the inputdata. Preferably, the tools are visual, and whenever possible, analysisresults are presented in graphical forms that are easy fornon-statisticians to understand. Also, it is preferred that the toolsare interactive. If a scientist indentifies an interesting set of datapoints, he/she can query the data set for more information on the pointsof interest, and define a permanent “research set” for the queried datapoints, providing an opportunity return to the research set for furtheranalysis in another session.

[0145] Similarly, but on a larger scale, the definition of a usefulpipeline of analyses can be archived for future re-use and analysis.With the availability of flexible analysis tools, a scientist canvisualize and analyze coherent data sets and form hypotheses directed togene function. The process of developing coherent data sets by employingthe methods of the present invention facilitates gene functionhypothesis formation by making data available in standard formats. Inaddition, data architects can determine standard storage architecturesfor input and output data, so that output from one tool can easily beused as input to another. A software engineering team can work withdomain scientists and statisticians to develop user interfaces. The mostchallenging data display can yield a huge amount of information to aneducated user. In such situations, one can address and interpretinformation using visualized multivariate data, as developed by domainscientists, statisticians, and engineers with expertise in visualizationand computer-human interaction. Data analysis and managementdevelopmental processes can involve trial-and-error approaches asdifferent visualization methods are examined and modified, prior to thederivation and adoption of solutions that are statistically sound andintuitively appealing.

[0146] To fully understand and utilize coherent data sets, tools andmethods for predicting gene function (or compound site of action) arerequired. Such tools and methods entail reiterative development tasksthat are developed using validated coherent data sets. Data in coherentdata sets tend to be highly multidimensional. For example, even thesmallest data set described herein represents 18 herbicide treatmentsfor which samples are collected at three time points. For each sample,responses are measured for approximately 6000 genes, approximately 250compounds, and about a dozen morphometric, or phenotypic, traits. Datadimensionality is reduced to determine an optimal degree of reduction.Dimension reduction is done via data pre-clustering, correlationanalysis, principle components analysis, or regression analysis.Aggressive dimension reduction leads to a much smaller and moretractable data set, but there is a caveat that biologically relevantdetail could be lost. Thus, some experimentation is useful to determinewhich data can be reduced without a loss in statistically verifiedquality.

[0147] Following a reduction in data dimensionality, patterns andsimilarities are identified. A number of multivariate analysis tools areemployed, such as, but not limited to, factor analysis, principlecomponents analysis, cluster analysis, and discriminant analysis toidentify patterns or similarities among the compounds (herbicides, forexample) or genes (knock-outs, for example). Research scientistsevaluate specific combinations of data and tools that are mostinformative with respect to identification of gene function. Differentviews of multidimensional data enable the research scientist to developinsights and formulate hypotheses directed to the relatedness of data.FIG. 7 shows an example of a tool that allows quick visualization ofnormalized data with respect to the baseline. FIG. 8 is an example ofvisualization of a two-dimensional comparison of data from two differenttechnologies. FIG. 9 shows different perspectives of data made by usinga three-dimensional visualization tool and illustrates the value oflooking at complex data in a three-dimensional format. FIG. 9 parts Aand B illustrate two different three-dimensional views of the same dataset. Note that while in FIG. 9A, the data appear to fall into twodiscrete groups, but if the figure is turned in three-dimensional spaceand viewed from a different side (FIG. 9B), the data no longer appear tobe in only two groups. FIG. 9 is illustrative of the fact that data fromcomplex systems and/or complex data sets can become overly simplifiedand thus, misleading, when viewed in only two dimensions. FIGS. 7through 9 provide examples of how complex data are visualized. In theembodiment illustrated in FIGS. 7-9, the data sets shown are from geneexpression analysis, phenotypic analysis, and metabolite analysis.However, data could be from any combination of technologies or datatypes.

[0148] The use of the present invention in analyzing complex data setsallows the formation of decision trees leading to hypotheses of genefunction or site of action. Based on identified patterns, decision treesare derived to predict gene function or compound site of action. FIG. 10illustrates one embodiment of the present invention demonstrating thecreation and use of a coherent data set, in which hypotheses are formedand tested by laboratory experiments. In the case of the herbicide siteof action (SOA) data set (Specific Example 2, infra), experimentalresults from compounds (herbicides) with known sites of action are usedto test and refine the multivariate models. Using models that classifyknown herbicides with a high degree of accuracy, predictions are madewith respect to herbicides having unknown sites of action. Predictionsare validated in the laboratory, and the results (both positive andnegative) are used to further refine predictive models. Similarly, forthe gene knock-out experiments, data for genes of known function areused to generate predictive models. As part of the iterative process, ifpredictions for compounds with known site of action, or genes with knownfunction are unreliable, then each step of the methodology from whichthe prediction is formed is reviewed and re-evaluated.

[0149] Criteria are established for selecting high-confidencepredictions, and for calculating the extent to which high confidencepredictions are produced as a percentage of a data set. Validatedpredictions formed by the methods of the present invention undergofurther validation in a laboratory. Although time consuming, the resultsof laboratory validation studies enable the calculation of predictivesuccess rate, further enabling monitoring of improvement in the qualityof analytical tools.

[0150] In one embodiment of the present invention, a high-throughputsystem is used for applying methods of the invention to an analysis ofcomplex disparate data. A high-throughput system for identifying genefunction preferably utilizes automation of tools and methods forbuilding predictive models. Automating and generalizing predictivemodeling is possible following verification that the logic and analysistools used to generate predictions are performing accurately. Developingand automating the tools is a reiterative process. Guidelines aredeveloped for choosing analysis tools for different scenarios and fordiagnosing potential problems. In addition, semi-automated gene functionanalysis tools provide higher degrees of access to complex data thanthat currently available in the art.

[0151] All predictions based on a coherent data set model are tested ina laboratory. From the herbicide SOA data set, unknown compounds withhigh-confidence predictions of site of action are subsequentlyvalidated. With the addition of data sets which characterize geneknock-out mutants, predictions of gene function are made. The particularapproaches used to test predictions of site of action or gene functionare identified and implemented with the assistance from domain experts.

Creation of an Integrated Data Set

[0152] In one embodiment of the present invention, three integrated datasets were generated, each with increasing size and complexity. The firstand simplest integrated data set was generated from a site of action(SOA) experiment, (hereinafter SOA1) that evaluated the effects of 18compounds (herbicides) on Arabidopsis. The site of action is known forsome of the 18 compounds. For two of the compounds, the mode of actionat the site of action is also known. SOA experiments are commonlyperformed, since identification of the site of action is oftensufficient knowledge for product development, even if the mode of actionhas not been determined. Of the 18 commercially available herbicidesused in SOA1, herbicides had nine known sites of action and one unknownsite of action. In some cases, different chemical classes of herbicidesaffecting a common site of action were used. For each herbicide, aseries of dose response curves were generated and a time course forsymptom development was established. Plant tissue was sampled at 3stages (early, middle and late) in symptom development. Sufficientmock-treated control plants were used at each sample stage to establisha baseline for each technology type. Data for the SOA1 experiment werecollected from three different technologies: gene expression analysis,metabolite analysis, and phenotypic analysis, which provided a total ofapproximately 50,000 data points.

[0153] A larger integrated data set was generated for data correspondingto 65 Arabidopsis mutants that were functional gene knock-outs(hereinafter GKO1). Data for the GKO1 experiment came from threedifferent technology types: sequencing and annotation, metaboliteanalysis, and phenotypic analysis. The GKO1 data set containedapproximately 300,000 data points. Challenges were encountered inintegrating the GKO1 data set. The data was stored in a variety offormats from several different technologies and utilized domain-expertscreening for quality control. Data architects, working in conjunctionwith biostatisticians and laboratory scientists within each technology,designed an integrated database schema capable of handling data from thedifferent technologies. The schema was normalized so that allinformation related to a particular mutant could be easily retrieved.Faced with highly heterogeneous sets of input data, bioinformaticistswrote custom conversion programs to populate the database. Softwareengineers worked with laboratory scientists and biostatisticians tobuild an interactive quality control module that allowed domainscientists to query the database for a mutant, to view graphs ofpertinent characteristics, and to remove low quality data. In addition,some parts of the quality control effort were fully automated. Thesemodules enabled unusually rapid and complete quality screening of a verylarge set of data.

[0154] The challenges of integrating the collection of GKO1 data wereovercome by a team with knowledge in database architecture, design, andimplementation; data processing and conversion; statistics and datavisualization; and software engineering and human-computer interaction.A view of an integrated data set for a single gene (or compound) isshown in FIG. 4. Referring now to FIG. 4, a Gene ID (a uniqueidentifier) is linked to data from sequence and annotation (annotation;DNA indicator), metabolite or biochemical analysis (BCP; metaboliteindicator), gene expression analysis (GDP; RNA indicator), andphenotypic analysis (phenotype indicator).

[0155] The largest integrated data set generated (hereinafter GKO2),corresponds to 600 Arabidopsis mutants that are functional geneknock-outs. Data for the GKO2 experiment were obtained from threedifferent technology types: sequencing and annotation, metaboliteanalysis, and phenotypic analysis. The GKO2 data set containedapproximately 3.5 million data points. Implementing batch processingwhen possible improved the process and efficiency of integrating theGKO2 data.

[0156] The FUNCTIONFINDER system is used in the acquisition and storageof data. The organization of FUNCTIONFINDER is shown in FIG. 5.FUNCTIONFINDER comprises four interrelated components: databases, dataprocessing, data analysis tools, and user interfaces. Data are extractedfrom the Refinery layer (REFN) and integrated in the Abstraction layer(ABST). Public databases and other sources of data are integrated in theAbstraction layer with any proprietary data or data generated“in-house.” Integrated data are used to generate coherent data which isstored in a relational database and subsequently extracted into coherentdata sets for efficient access by Discovery layer (DISC) tools.

[0157] Data are produced on a variety of instruments, and initialstorage is in a variety of media, such as proprietary databases, LIMS,flat files, Excel spreadsheets, and the like. In the methods of thepresent invention, all generated data are loaded into an integrateddatabase. A Refinery database can contain data related to soil samples,such as experimental plants grown in a flat (container) of soil. Datacollected on the soil samples is stored in a Laboratory InformationManagement System (LIMS). To populate the Refinery, a computer programcopies information from LIMS into the Refinery. Data about a mutatedgene in the experimental transgenic plants is stored in a separateproprietary database. To further populate the Refinery, another computerprogram copies information from the proprietary database to the RefineryDatabase. Integrity checking and enforcement takes place as the data areloaded, ensuring that all data in the database are integrated: i.e.,identified and linked to all associated data. Data in the refinery areassociated with a measurement set, a collection of measurements allrelated to one experiment. Enforcing data integrity ensures that eachdata point is correctly associated to a measurement set. The integrateddatabase stores data in a tree-like structure, so that a measurement canbe linked to other measurements further up the tree, and measurementsfurther down the tree can be linked to it. Integrity checking ensuresthat all upward links are present and valid when a data point is stored.

[0158] Sample identification (ID) is a necessity to the methods andsystems of the present invention. To obtain truly integrated data, eachsample must have a unique identifier that allows it to be linked withall data acquired from each sample. For example, in the herbicide SOAexperiment, samples were derived from Arabidopsis plant tissue. EachArabidopsis transgenic construct is made of two plasmid parts, a driverand a target, and the construct entry has references to the identity ofthe driver and target used. When a construct is added to the list,integrity checking ensures that the Target Plasmid ID and Driver PlasmidID both refer to plasmids that are already in the list. If not, theentry is rejected. The mutant plants are grown in flats. Each flat setthat is planted uses experimental (mutant) plants from a singleconstruct. The flat set entry contains a reference to the Construct IDthat is planted. When a flat set is added to the list, integritychecking ensures that the Construct ID refers to a construct that isalready in the list. If not, the entry is rejected.

[0159] When data are acquired, they are fed directly into the RefineryDatabase. Data in the Refinery Database are subjected to a number ofquality checks to insure that the data used in later calculations areaccurate and consistent. In the example of the herbicide SOA experimentin Arabidopsis plants, the number of rosette leaves is counted andrecorded on each even-numbered day from Day 14 (after planting) untilthe first flower buds are observed on the plant. Throughout thisobservation period, the number of rosette leaves should be anon-decreasing sequence, such as is characterized in Table 1. TABLE 1Day 14 Day 16 Day 18 Day 20 Day 22 0 2 2 4 6

[0160] If the number entered on Day 20 were “8,” it would indicate thata mistake was made in the data entry or data observation. A data qualitycheck relies on examination of the entire sequence of measurements: avalue of 8 rosette leaves on Day 20 may be perfectly reasonable byitself, but is clearly an error in the context of the othermeasurements.

[0161] An example of another type of data that could be used in thecreation of integrated data and, ultimately, coherent data sets, is themeasurement set collected for flower production in Arabidopsis. The dayon which flower production started, the day on which flower productionstopped, and the day on which seeds are harvested, are all recorded. Theday on which flower production stopped must be greater than the day onwhich it started, and also must be less than the day on which seeds wereharvested. If a data point is chronologically outside the pattern, itcan be inferred that one of the recorded values is in error, although itcannot always be inferred which recorded value is wrong. Data pointsthat are clearly in error (as in the example for rosette leaves) areflagged as erroneous data points in the Refinery Database so that theywill not be used in future calculations and conclusions. Data pointsthat may be error prone (as in the flower production example) areflagged as questionable data points in the Refinery Database. Dependingon the application, future calculations may or may not use flaggedobservations.

Creation of a Coherent Data Set from an Integrated Data Set

[0162] Data that passes quality control is transformed into coherentdata sets. One goal of a coherent data set is to directly compare dataof different types recorded in different measurement scales. When acoherent data set is created, the same analysis methods can be used onany subset of the coherent data set. In one embodiment of the presentinvention, a coherent data set is created from the Arabidopsis herbicideSOA experimental data (SOA1) in the following way:

[0163] 1. Each data point is expressed as a numeric measurement. In thecase of a descriptor (such as “Brown leaf color”), the number orfrequency of such observations can be recorded. In other cases, onecould record the severity of an observation, such as rating the lesionson a leaf on a scale of 0 (no lesions) to 10 (completely covered withlesions).

[0164] 2. Each measurement type (e.g. leaf count or stem length) istransformed to a Gaussian distribution.

[0165] 3. Each data point is standardized to an appropriate controlmeasurement, and expressed as a number of standard deviations above orbelow control, or baseline, mean.

[0166] 4. Optionally, the data are further summarized (such as taking aweighted average of several measurements) to reduce the dimensionalityof the data set.

[0167] The above steps 1-4 are followed for each measurement type in thedata set. When the steps are completed, all the measurements have thesame distribution, and all are expressed in the same units, for example,standard deviations above or below a control mean.

Deriving Coherent Information from Experimental Data

[0168] The maximum rosette radius is recorded for each plant in aphenomics flat. Analysis has shown that maximum rosette radius is notnormally distributed, so a square root transformation is used to achieveapproximate normality. The average square root rosette radius is thennormalized to a comparable control value to obtain a normalized value of−2.84, indicating that the square root rosette radius is 2.84 standarddeviations below the control mean. When the same process is performedfor a biochemical compound reading, such as lysine, which requires alog-transformation, a normalized value of 3.22 is obtained. In thisparticular case, rosette radius is significantly smaller, and lysineproduction significantly larger, when compared to control plants.

Correlation Analysis of Coherent Information and Hypothesis of GeneFunction for Glabrous Gene

[0169] Coherent information is analyzed in a variety of ways.Statistical analyses that are widely used include cluster analysis,discriminant analysis, principle components analysis, correlationanalysis, and factor analysis. Broadly, the purpose of statisticalanalyses is to find patterns of similarity and difference in thecoherent data sets. One purpose of the analyses is to gather informationabout how perturbations (genetic, chemical, etc) of an organism affectsthe total phenotype (gene expression, biochemical expression,morphometric expression) of the organism. For example, correlationanalysis shows that when a particular Arabidopsis gene (called“glabrous”) is inactivated, the resulting plant will have no trichomes,or plant hairs. The absence of plant hairs indicates that one functionof the glabrous gene is involved in trichome production. Furtherexperimentation revealed that glabrous is a transcription factor thatacts as a “switch” which turns on or off the gene that is directlyresponsible for forming the cellular structure of trichomes. Thus, auseful correlation is established between the phenotype (no plant hairs)and the disruption of glabrous, the transcription factor that controlsthe gene responsible for the formation of trichomes.

Principle Components Analysis of Coherent Information and Hypothesis ofGene Function for Herbicidal Action

[0170] Principle components analysis of the herbicide SOA data (SOA1)shows that the application of a herbicide that accepts electrons from aphotosystem I (PSI) inhibitor is linked to several observable effects:differential regulation of a suite of genes (GEA data), differentialexpression of a collection of biochemicals (metabolite analysis), and aspecific observed phenotype. Data gathered from observable traitsenables the hypothesis that particular genes cause particular chemicalchanges to bring about particular phenotypic behavior. The SOA1 data arediscussed in more detail in Specific Example 1, infra.

Verifying Hypothesis of Gene Function and Designing New Experiments: PSIInhibitor

[0171] A hypothesis of gene function is limited by the assumptionsrelied upon in forming the hypothesis. An unverified or untestedhypothesis is nothing more than an educated guess about what a genedoes. A variety of “wet bench” (laboratory) and bioinformaticexperiments can be used to prove or disprove hypotheses. Principlecomponents analysis suggests that a particular herbicide inducesreactions similar to those of a PSI inhibitor. A laboratory experimentperformed directly on the herbicide in solution demonstrates that theherbicide is not a PSI inhibitor, thereby disproving the initialhypothesis of herbicide function. FIG. 10 illustrates one embodiment ofthe methods of the present invention as applied to, for example, theexperimental data from SOA1 (Specific Example 2, infra).

Verifying Hypothesis of Gene Function and Designing New Experiments:Transcription Factor

[0172] When the original connection between the glabrous gene andtrichome production was observed, a number of hypotheses were suggested.One hypothesis was that glabrous might be directly responsible fortrichome production. A second hypothesis was that glabrous might be atranscription factor for another gene that is directly responsible fortrichome production. A third hypothesis was that glabrous and thedirectly responsible gene might both be regulated by a third gene.Bioinformatic analysis shows that glabrous has a structure similar toother transcription factors and wet bench experiments show thatregulating glabrous affects another gene but not vice versa. Finally, itcan be demonstrated that glabrous binds to a specific protein. A reviewof the evidence resulted in a conclusion that glabrous is atranscription factor for the gene that causes trichome production.

Integrating Profiling Technologies for Defining Herbicidal Site ofAction

[0173] Herbicide development has traditionally involved multiple roundsof spray trials to identify and refine lead compounds accompanied bylengthy biochemical experiments in a search for the site of action. Theconvergence of multiple technologies has positioned the agrochemicaldiscovery and development process for potentially dramatic change. Onechange is the transition from whole organism testing to the use ofmechanistic in vitro assays for primary screening. Transitioning to invitro assays has been driven, in part, by the emergence of combinatorialchemistry, a methodology capable of generating vast chemical librariescontaining small quantities of each chemical. In vitro assays are moreamenable to high or ultra high throughput screening and miniaturizationthan whole organism testing and the latter has been relegated to laterstages of the herbicide development process. Whole organism testing asan initial screen is also less desirable in light of the waning numberof new targets found by this approach despite screening with increasingnumbers of compounds. Interestingly, whole organism testing has lead tothe discovery of only 20 distinct sites of action for all herbicides inthe past 60 years, while estimates of potential herbicide targets exceedthis number by two orders of magnitude. Ward & Bernasconi, 17 NATUREBIOTECH. 618-19 (1999). Thus, despite the fact that all potentialtargets sites are available when screening with whole organisms, only afraction of the potential herbicide targets have been identified andexploited.

[0174] The advent of complete sequence information for the model plantsystem Arabidopsis has enabled a systematic exploration of gene functionthat directly complements herbicide discovery via in vitro assays.Efforts to increase and decrease the expression of every gene inArabidopsis by molecular genetic manipulations are underway. Phenotypesof the corresponding mutants are being systematically profiled in bothpublic and private efforts. In this way, all potential herbicide targetscan be identified and the most promising chosen for a screening programusing in vitro assays.

[0175] A number of genomic technologies have been developed to capturethe molecular details of genetically altered or treated tissue. Genomictechnologies include profiling changes at the transcript, protein, andmetabolite levels. Previous investigators have validated the approach ofcreating a compendium of transcriptional profiles to facilitate theidentification of the site of action or site of action of an unknowncompound. Profiles of known mutants were compared to profiles of unknownmutants, and where a reasonable similarity occurred, it was determinedthat the unknowns had a common site of action/mode of action (SOA/MOA).Generation of a database of profiles corresponding to all putativeherbicide targets would be an extremely valuable resource fordevelopment of new herbicides. Currently there are many herbicides wherethe site of action and/or the mode of action are not known, but could berapidly determined using a compendium approach.

[0176] Herbicides developed via an in vitro system must be plant-testedand the molecular details of the plant response need to be defined.Herbicides developed against a target in vitro may preferentiallyinactivate a different site in vivo or may target multiple sites.Insight into these details is essential for responsible productstewardship in an intense regulatory environment. One purpose of theArabidopsis herbicide SOA study was to evaluate phenotypic,transcriptional, and metabolic analysis technologies for building acompendium database to determine herbicide SOA/MOA. A collection ofherbicide treated tissue, forming a test set, was used to generate datafrom three different technology types. Data was evaluated for accuracyin grouping the herbicides into target classes. Determining the site ofaction of herbicides has traditionally been an involved and lengthyprocess requiring extensive biochemical studies. Described herein aremethods for utilizing phenotypic, transcriptional, and metaboliteanalysis technologies that accurately grouped a set of 18 herbicidesinto nine distinct sites of action. It is important to note that usingdata obtained from only any one or two of the technology types resultedin false groupings. The results suggest that a comprehensive database ofintegrated, coherent data derived from tissue systematically treatedwith specific chemical inhibitors enables the prediction of the site ofaction of virtually any herbicide.

Integrating Profiling Technologies for Defining a Human Disease State

[0177] Methods and systems of the present invention provide for thediagnosis and treatment of human diseases, such as diabetes mellitus.Diabetes Mellitus (DM) is a disorder characterized by chronichyperglycemia, and diabetes symptoms include altered carbohydrate, fat,and protein metabolism. Diabetes is a complex disease of multipleetiology, which complicates treatment, and increases the risk ofmisdiagnosis. In many cases, a collective view of test results isrequired for even a non-exacting diagnosis, and the data from no singletest is inherently diagnostic, nor are singular test results readilyable to posit causality, explain anomalies, or direct further researchor testing. Data can be and has been generated through a variety ofapproaches, but within a technology only gross fluctuations may beevident or capable of correlation and association with DM. An expandedview across integrated data streams can increase the benefits of currenttest results through furthering interpretive capacity, as well asfurthering opportunities to establish correlations by increasing levelsof experimental range, resolution, and accuracy. Coherence may, in part,have already been established through the standardization of methods forobtaining data, and analysis may further refine methods for obtainingdata. As coherence is more definitively established in the data,diagnostic capacity should increase, and patterns or profiles, limitednot only to the gross disease, but also for individual variants withinthe disease, should begin to emerge.

[0178] The present invention provides methods and systems for the use ofcoherent data sets in studies of DM, and other human and animaldiseases. A murine model system contains data streams generated via sixdifferent technologies: genotype/sequence data, gene expression data(GEA), metabolite analysis, phenotypic analysis data, SNP data, andproteomics data. Data from each technology type can be collected;subjected to quality control; integrated with data from the othertechnology types; and analyzed into increasing degrees of coherence.

[0179] A hereditary link has already been established for diabetesmellitus, but it is a complex disease with both genetic andenvironmental components. Davies et al., 371 NATURE 130-136 (1994). Someregions of the genome have been established as indicators of risk of DM,but are not wholly diagnostic. Hashimoto et al., 371 NATURE 161-164(1994). In many cases, genetic factors are not clearly evidenced for allforms of the disease. To narrow down and understand the geneticalterations relevant to DM, additional specific information is neededwith respect to genetic lesions an individual carries, as well ascoherent links to more specific information about patient health (grossphenotype), gene expression, protein expression, and metaboliteanalysis. Coherent links are particularly instructive to establishpossible causative factors in cases where a hereditary link is notclear. Although the use of human genotypic data are desirable, a mousemodel system provides greater initial comparability through thecontrolled nature of gene knock-out and knock-in experiments, andprovides a foundation upon which to build heterogenous human geneticdata. Knock-out murine models have been reported in the literature as amodel for the study of DM, specifically with a Akt2 gene knock-out. Choet al., 202 SCIENCE 1728-1731 (2001).

[0180] A controlled genetic system also provides for comparablephenotypic data. Comparable phenotypic data refers primarily to grossphenotypes with potentially diverse individualized measurements, ascompared to the molecular phenotypes (often of limited range) andaspects of measurements from other technologies (such as genotype, geneexpression analysis, metabolite analysis, SNP analysis, and proteomics).In mice, phenotypic data can extend many levels beyond those availablewith humans, allowing analysis of organ architecture and age-relatedprofiles. Even with humans, however, the expansion of phenotypic databeyond the limited range currently known to have diagnostic potentialcould lead to an improved understanding and establishment of relevantcorrelations when placed within a set of coherent data. Qualitative andquantitative data are used as criteria for diagnosing diabetes, such as,for example, increased thirst, increased urine production, blurredvision, and blood sugar levels, but are not always diagnostic. Newphenotypic data could be measured and those already measured could bemade more exacting. A similar approach has been reported using a plantmodel. Boyes et al., 13 PLANT CELL 1499-1510 (2001). Linkage ofphenotypic data to coherent data sets could ultimately provide earlier,more exacting and reliable diagnoses of DM. Winkelmann, 2PHARMACOGENOMICS 11-24 (2001).

[0181] Gene expression analysis (GEA) provides a quantitative measure ofindividual gene expression as reflected in cellular RNA content forvarious mRNAs and alternative mRNA forms. A number of studies of geneexpression have been performed to look at changes associated with DM.For example, GEA data has been used to observe differences in theexpression of glutaminase and glutamine synthase and tissue specificglutaminase and glutamine synthase transcripts in DM. Labow et al., 131J. NUTRITION 2467S-2474S (2001). Independent of other data, such aslevels of the metabolite glutamine, or expression of the proteins codedfor by the mRNAs, conclusions based upon glutaminase and glutaminesynthase data are limited in a way that is overcome by inclusion of thedata in a coherent data set. Similarly, a range of gross and molecularphenotypes are traceable to mutation in a single transcription factor,for example MODY, most easily identified by a GEA profile when the dataare properly interlinked and available for analysis in a coherent dataset. Owen & Hattersley, 15 BEST PRAC. RES. CLIN. ENDOCRINOL. METAB.309-323 (2001).

[0182] Proteomics, in the context of the present invention, isunderstood as data largely produced through two-dimensional gelelectrophoresis to identify the presence and patterns of cellularprotein expression and modification. In this respect, it is quiteanalogous to GEA data. Some forms of DM show specific alterations inprotein expression and modification, most obviously in the expressionand modification of insulin. Insulin is initially produced as peptidepreproinsulin. A portion of the peptide is then cleaved off to produceproinsulin in the lumen of a cell's rough endoplasmic reticulum. Withinsecretory granules of a pancreatic beta cell, proinsulin is then cleavedto form the final alpha and beta chains of insulin, plus the“connecting” peptide. Misexpression of insulin precursors and the finalform of the insulin protein may indicate a critical defect causative ofdiabetes, and one that might be correlated with, for example, mutationsin the gene sequence (genotype data), or altered expression of relevantproteases (GEA data), if combined with the methods and systems of thepresent invention to create coherent data sets. Likewise, previouslyunidentified protein alterations might be discovered by correlation withdata from other technologies in a coherent data set.

[0183] Metabolite analysis is particularly useful in the study of DM,since DM is a metabolic disorder. Individual metabolites present incells are identified and/or measured, establishing the presence,quantities, patterns, and modifications of small biomolecules, often thesubstrates and products of enzymatic reactions. Uniting genotype, GEA,proteomics, and metabolite analytical data provides a deep andinterconnected window to the molecular/cellular level to correlate withintercellular and gross phenotype data. DM is a metabolic disorder witha failure of cellular uptake of glucose and a consequent altering ofprotein and fat metabolism, and these changes are detected usingmetabolite analysis technologies. Increased fat metabolism can lead toketoacidosis, but as with the other technologies, absentcontraindication, metabolite analysis data reflecting ketoacidosis canlead to misdiagnosis, in this case as hyperventilation syndrome.Treasure et al., 294 BR. MED. J. (Clin. Res. Ed.) 630 (1987).

[0184] Establishing coherent data sets created from data streams ofdifferent research technologies and manipulating and analyzing the databy computer-based methods and systems allows emergence of newconnections, correlations, and understanding of gene function, whichresults in new and improved tools and treatments for managing disease.Ultimately, coherent data sets improve diagnosis and monitoring byproviding exacting profiles of genetic, metabolic, and gene and proteinexpression alterations that correspond to disease states, independent ofpostulating rules, higher order structures, or causation. In a complexdisease like DM, coherent data sets also allow a very exactingreclassification of subtypes of the disease based on the differentsignature profiles that lead to the disease state. Signature profiles ina computer database of high coherence (comparability) will allow forrapid and clear diagnosis when used to match patient data with signatureprofiles for disease. Identification of co-heritable diseases that mightotherwise be masked, such as coeliac disease with Type 1 diabetes, isgreatly simplified through establishing clear signature profiles andprofile subtypes. Laloux et al., 13 DIABETES METAB. 520-528 (1987).Disease diagnosis is dynamic, requiring monitoring and re-evaluation. Bymonitoring a patient from one diagnostic state to another, coherent datasets are produced for the changes that occur as a disease eitherprogresses or improves, permitting enhanced predictive and preventivemeasures, and increasing the chances of stabilizing a condition.

[0185] By postulating causative agents and critical targets from theanalysis of specific profiles, treatment is individualized, and specifictargets are provided for high throughput efforts of drug discovery.Monitoring changes in a signature profile over a course of treatmentwill make clear whether a drug is directly affecting the molecularphenotypes/symptoms, permitting drug validation, as well as making clearundesirable secondary effects that will be further monitored in attemptsto optimize the drug design and dosage. Methods of the present inventioncan result in coherent data sets that provide rational, and thus lesscostly, drug screening, as well as rational and validated design andproduct improvement.

Correlation of Data with Biochemical Pathway Information

[0186] Another aspect of the present invention is to providecomprehensive methods and systems for linking metabolites in cells,biofluids, and tissues, to biochemical reactions, pathways, and pathwaynetworks. It is generally accepted that a metabolic response of livingorganisms is altered by genetic makeup (or change), disease state,chemical exposure (including therapeutic treatment) or environmentalinsult. Thus, the methods of the present invention are particularlyuseful for understanding the relationship between biochemical responseand disease or phenotypic association.

[0187] The methods and systems of the present invention are useful forlinking a particular metabolite or enzyme with all associatedbiochemical reactions and/or pathways. Existing metabolic databases suchas KEGG (Kyoto Encyclopedia of Genes and Genomes, Institute for ChemicalResearch, Kyoto University, Japan), BRENDA (Institute of Biochemistry,University of Cologne, Germany), and EMP (Enzymes and MetabolicPathways, EMP, Inc., New York, N.Y.) are large, but error prone.Furthermore, above databases do not represent the complex network ofmetabolism in a manner that allows for retrieval of an accurate,comprehensive list of the metabolic linkages. For example, BRENDAcontains information on genes with associated reactions, but fails toprovide linkages to the corresponding biochemical pathways. While KEGGprovides pathway information, the pathways are stored as unorderedcollections of catalyzed reactions. In addition to the lack of order inthe pathways, KEGG consists of a generic listing of multiple species,rendering accurate retrieval of human metabolic data impossible. Incontrast, the current invention provides methods and systems forobtaining the linkage of any metabolite or enzyme, in a particular cell,biofluid, or tissue, with all associated biochemical reactions and/orpathways, and/or disease, and/or phenotype associations.

[0188] In one embodiment of the present invention, methods and systemsare provided for linking a complete spectrum of metabolites in a cell,biofluid, or tissue, from an organism to biochemical reactions andpathways, and correlating the biochemical reactions and/or pathways to aphenotype of the organism. In this manner the methods of the inventionare useful for correlating a biochemical profile with a disease state.The methods and systems of the invention provide for linking a completespectrum of metabolites in a cell, biofluid, or tissue, from a diseasedor treated organism to biochemical reactions and pathways, andcorrelating the biochemical reactions and/or pathways to a site ofaction of a disease or therapeutic modality. In this manner the methodsand systems of the invention are used for discovering or validating thata therapeutic affects a target biochemical reaction and/or pathway. Themethods and systems of the present invention are also useful formonitoring the disease stage of an organism, diagnosing an organism witha particular disease, and monitoring the efficacy of a therapeutic on anorganism, such as the yeast azole drug experiment discussed in SpecificExample 5, infra.

[0189] In other aspects, the present invention provides methods andsystems for computing all possible biochemical pathways that link afirst metabolite to a second metabolite; compiling all possiblecompounds that result from the biosynthesis or degradation of aparticular metabolite; identifying all possible biochemical reactionsand/or pathways in which a particular enzyme is involved; andidentifying all possible biochemical reactions and/or pathways in whicha particular metabolite is involved.

[0190] The methods and systems of the present invention encompass thedevelopment and use of a database of endogenous metabolites, inclusiveof the metabolites found in different organisms and the biochemicalreactions in which those metabolites are involved. The database ofendogenous metabolites is useful in correlating disease states,phenotypes, and metabolites. Data from the database of endogenousmetabolites can be incorporated into coherent data sets, ultimatelyallowing linkage of any coherent data set data, such as gene expressiondata, to disease states and phenotypes. Included in the methods andsystems of the present invention are comprehensive and quantitativeanalyses of low molecular weight biochemicals revealing a metabolome.The metabolome is best described by analogy to the genome, i.e. wherethe human genome is the set of all genes in a human, the humanmetabolome is the set of all endogenous metabolites in a human. Thescience of genomics is based upon a genome and the science ofmetabolomics is based upon a metabolome. To continue thegenome/metabolome analogy, any published human genomic sequence is astatistical approximation, as it is derived from a limited number ofindividuals, and any individual necessarily has a unique genome.Similarly, the human metabolome is a statistical approximation of thetotal human metabolic potential. Furthermore, just as the human genomeis differentiable from other genomes, for instance, the Xenopus orCaenothus genomes, the human metabolome that defines the humanbiochemical potential is differentiable from other metabolomes.

[0191] The database of endogenous metabolites is a comprehensive set ofall potential metabolites, or chemical components, which can be found inthe cells, biofluids, or tissues of any individual under all conditions.It is likely that most individuals vary in their biochemical potential,expressing only incomplete subsets of the metabolome, depending on theirgenetic makeup, environmental conditions, and state of health. Indeed,many metabolic diseases and even the efficacy of most drugs is variable,due, at least in part, to individual variances in metabolism and theresulting biochemistry.

[0192] The metabolome of an organism is the total set of all endogenousmetabolites found in the organism. The metabolite, or biochemical,profile of a biological sample is a list of any endogenous metabolitesdetected in the sample, together with a measure of how far eachmetabolite varies from its baseline value. Experiments show that thebiochemical profile of a mouse heart (FIG. 11A) is different from thebiochemical profile of a mouse kidney (FIG. 11B). By monitoringbiochemical or endogenous metabolite profiles, one can diagnose disease,identify the stage of the disease, offer a prognosis, and suggest atreatment. Further, a treated individual can be monitored throughout thecourse of a disease, tracking the stages of the disease as treatment isapplied to ensure that the treatment received remains efficacious.Treatment can be adjusted according to results obtained from metaboliteanalysis.

[0193] Metabolite analysis is particularly applicable to problems inwhich physiology is altered, e.g. through stress, disease, chemical, orother insult. Roessner et al., 13 PLANT CELL 11-29 (2001); Glassbrook etal., 18 NATURE BIOTECH. 1142-1143 (2000). Similar to transcriptomics andproteomics, the application of metabolomics is a global view of anorganism, i.e. attempting to understand the current physiological statusof a sample or organism in light of its full physiologic potential.Metabolomics information can be combined with data from other biologicalindicators in a coherent data set.

[0194] Unlike transcriptional or proteomic analysis, biochemicalanalysis directly reflects physiological status. Whereas the nature andrelationship of almost all metabolomic entities (i.e. biochemicals) havebeen thoroughly established through decades of biochemicalinvestigations, the vast majority of genes, transcripts, and/or proteinsare only partially characterized; the functional significance thereof isoften largely hypothetical, if understood at all. The application ofmetabolomics characterizes the physiological state of a sample bydetermining the actual or relative concentration of the entire set ofsmall molecules that constitute metabolism. The establishment of adatabase of endogenous metabolites will enhance the application ofmetabolomics.

[0195] For the purpose of this invention, the database of endogenousmetabolites consists of the native small molecules (e.g. non-polymericcompounds) involved in metabolic reactions required for the maintenance,growth, and function of a cell. The following implications flow fromthis definition:

[0196] 1. Enzymes, other proteins, and most peptides are generally notsmall molecules and thus excluded. Many proteins participate inbiochemical reactions with small molecules (e.g. isoprenylation,glycosylation, and the like). The construction and degradation ofpolypeptides results in either the consumption or generation of smallmolecules and, thus, the small molecules rather than the proteins makeup the metabolome.

[0197] 2. Genetic material (all forms of DNA and RNA) is also excludedfrom the metabolome based on size and function. The construction anddegradation of polynucleotides results in either the consumption orgeneration of small molecules and, thus, the small molecules rather thanthe polynucleotides are part of the metabolome.

[0198] 3. Structural molecules (e.g. glycosaminoglycans and otherpolymeric units) similarly may be constructed of and/or degraded tosmall molecules, but do not otherwise participate in metabolicreactions. Thus, structural molecules are excluded from the metabolome.

[0199] 4. Polymeric compounds such as glycogen are importantparticipants in metabolic reactions, but are not chemically defineableand, but are source of metabolites (i.e. an input/output to metabolism).Thus, polymeric compounds are excluded from the metabolome.

[0200] 5. Metabolites of xenobiotics are neither native, required forthe maintenance or growth, nor required for the normal function of acell, and thus are not part of the metabolome. However, it is useful tomonitor xenobiotics when observing the effects of a drug therapyprogram, or in experimentally determining the effects of a compound onan individual.

[0201] 6. Essential or nutritionally required compounds are notsynthesized de novo, (i.e. not native), but are required for themaintenance, growth, or normal function of a cell. Therefore, essentialor nutritionally required compounds are part of the metabolome.

[0202] The foregoing definition of the database of endogenousmetabolites emphasizes the focus of one embodiment of the presentinvention with respect to metabolism and physiology. As a matter ofhistorical precedence, the term “metabolite” is often interpreted toconsist of only the subset of metabolites that are part of degradationpathways. However, in the instant case, the terms “biochemical” and“metabolite” are viewed as congruent terms and used interchangeably.Similar congruence is intended for the terms “biochemical profiling,”“metabolite profiling,” and “metabolic profiling.” The foregoingdefinition is not meant to be limiting in the sense of metabolites onlyas part of degradation pathways, but rather the intention of the term“metabolite” is the broadest possible definition of a biochemicalinvolved in metabolism inclusive of catabolism.

[0203] The present invention encompasses methods and systems forestablishing a database of endogenous metabolites. Construction ofmetabolic networks in microbes has been accomplished previously. Selkov,3 PROC. INT. CONF. INTELL. SYST. MOL. BIOL. 127-135 (1995). In thepresent invention, and as shown in FIG. 3, the database of endogenousmetabolites is constructed using a combination of mining existingdatabases and literature sources for known metabolites having associatedreactions and/or pathways and characterizing and/or identifyingmetabolites present in experimentally derived chromatograms. The presentinvention provides methods and systems for creating a database ofendogenous metabolites that provides information about biochemicalpathway designation and disease and/or phenotype association forcompounds of interest, and provides data useful in the formation ofcoherent data sets. Selkov et al., 28 PROC. NAT'L. ACAD. SCI. U.S.A.3509-3514 (2000); Covert et al., 26 TRENDS BIOCHEM. SCI. 179-186 (2001).When required, biochemical standards are obtained so that the databaseof endogenous metabolites is based on empirical data. In this manner, anaccurate and comprehensive representation of biochemical potential isobtained.

[0204] For example, to generate and build a database of endogenousmetabolites, a genome of an organism of interest is mined for all genesannotated as enzymes. The organisms of interest include animalia,plantae, protista, monera, and fungi. More specifically, the organismsof interest include, but are not limited to, human and non-humanprimates, canines, felines, equines, bovines, porcines, rabbits,rodents, Magnaporthe, Candida, Mycosphaerella, Botrytis, Saccharomyces,Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium, Phytophthor,Penicillium, Arabidopsis, corn, wheat, barley, rye, legumes, mint,tobacco, tomatoes, rice, spinach, and peas. A preliminary list ofenzymes is qualified to ascertain that the enzymes are all generallyaccepted in the art as being involved in the metabolism of the organismof interest. The qualified enzymes are used to generate a preliminarylist of associated reactions by reference to existing metabolicdatabases. Biochemical and metabolic linkage information is entered intoa database, and additional reactions in which the preliminarymetabolites are known to participate are characterized and/oridentified. The sequence of the enzymes involved in the newly identifiedreactions is obtained from the genome of the organism of interest. Theforegoing steps are reiterated until as much metabolic information aspossible is uncovered and retained. At the point of sufficientunderstanding of the framework of the metabolism of an organism ofinterest, whole pathways are deduced from the existing collection ofmetabolic reactions. The enzymes involved in the newly implicatedpathways become a source of additional information, and the steps arerepeated as described.

[0205] To obtain a comprehensive metabolite database, additional methodsare used to complete pathways and identify peripheral pathways. One suchmethod is curating biochemicals and associated reactions/pathways basedon available literature. Another method is characterizing and/oridentifying biochemicals in experimentally derived chromatographs. Abenefit of the reaction-based approach of the current invention is thatall of the metabolites in the metabolome are associated with one or moreenzymes, and fit into known biosynthetic relationships. Previouslyproposed approaches based completely on chemistry suffer from thedrawback of being limited to lists of disjointed compounds.

[0206] One aspect of the present invention is to provide a database ofendogenous metabolites suitable for use with human conditions.Preliminary estimates of the total number of compounds in a human arevaried. The standard wall-chart of metabolism, which includes reactionsnot present in humans, lists only about 800 compounds in core primarymetabolism. Most biochemical textbooks extend this list to no more than1200 to 1500 compounds, again drawing from all life forms. Extensivequerying of publicly available databases for human metabolites enablesextension of the list to approximately 2000 compounds. Even assuming thefinal number of compounds in the human metabolome to be between 3000 and4000, the size of the metabolome is workable and forms a firm foundationfor scientific discovery.

[0207] The methods and systems used in the present invention tocharacterize and/or identify biochemicals are based on spectroscopic, orspectral analysis, procedures. Spectroscopic methods have been utilizedfor decades for the detection of biochemicals. Conventionally,biochemicals were separated based on chemical properties. The types ofbiochemicals under investigation dictate the detection methods employed(e.g., electrochemical, ultraviolet (UV), nuclear magnetic resonance(NMR), mass spectrometry (MS)). With decades of improvements ininstrument hardware and computer systems, greater sensitivity andresolution have been achieved for simultaneous detection of a broadrange of biochemicals.

[0208] The methods and systems of the present invention encompass, forexample, use of Nuclear Magnetic Resonance (NMR) spectroscopy and MassSpectrometry (MS), two of the most commonly used techniques for thedetection of biochemicals. NMR spectroscopy has been applied to developunique patterns for chemical-induced toxicity, and for determiningbiomarkers associated with specific disease states. Most of thesestudies have focussed on analysis of metabolites in biofluids. With highfield strength magnets (500 MHz and up), NMR data can be acquired on abroad range of metabolites without the requirement of chromatographicseparation. In cases of spectral overlap, multidimensional NMR methodscan be used to resolve metabolite profiles. Hyphenated NMR methods (suchas liquid chromatography-NMR) have also been used when metaboliteseparation is necessary. NMR methods are also used for detection ofmetabolites directly in tissue (using magic angle spinning techniques),and tissue metabolites are measured via NMR following extraction methodsthat are typically employed with such technologies and are known bythose skilled in the art.

[0209] The following techniques are also used in the present inventionfor the characterization and/or identification of biochemicals. MassSpectrometry (MS) is the most common technique employed for metabolomicstudies, and has an advantage over other technologies (NMR) in providinggreater sensitivity and resolution. As with NMR, hyphenated techniquesare often employed in the MS analysis, including front-end gaschromatography (GC) or liquid chromatography (LC) methods. A variety ofMS techniques must be employed to characterize and/or identify and coverthe wide-range of chemical classes that occur in biofluids, tissues, andcells. Aspects of MS techniques may include, but are not limited to,time-of-flight, Fourier transform, ion traps, and quadrapoles, using avariety of ionization methods (e.g., electronic spray ionization,chemical ionization, and the like). With a specific combination of MSdetector type and ionization method, a highly sensitive and resolvedtechnology method is obtained allowing for simultaneous measurement ofthe comprehensive set of biochemicals comprising the metabolome.Hyphenated detection systems, such as MS-MS, also result in increasedresolution of chemical components.

[0210] In the case of the current invention, as for all technologiesthat result in the measurement of a broad range of components, a majorchallenge is in data extraction and correlation with biologicalsignificance. To effectively manage and utilize the vast amount of datagenerated to create the human metabolome, informatics software and toolsfor representing and analyzing data are developed. Complex computationalmethods are essential for organizing data, analyzing large-scale datasets, generating new hypotheses, and deriving useful information fromcollected data. These techniques have been successfully demonstrated inthe area of gene expression and are applied to metabolomics data withfew modifications. To date, most published data analysis methods arebased on clustering, principle component analysis, partial least square,and analysis of variance. However, caution is taken to meet thestatistical requirements for such tests and to avoid misinterpretations.Bioinformatics tools are available for manipulating complex data sets,however, more advanced tools specifically designed for metabolomics dataare provided in the current invention to link specific metabolites withcells and tissues within an organism.

SPECIFIC EXAMPLE 1 Preparation of a Database of Endogenous Metabolitesfor Arabidopsis Thaliana

[0211] To generate a database of metabolites, a list of potentiallydetectable plant compounds for each analysis methodology was createdusing the known function and metabolic pathways of the plant tissue tobe studied. In addition, spectral peaks routinely observed in the plantsamples were catalogued in the database. In some cases, datacorresponding to the spectral peaks without a confirmed identityindicated additional compounds of interest for validation. The processfor generating the database of endogenous metabolites was as follows:nominate compounds of interest, obtain the compounds (if possible),prepare and perform metabolite analysis of the compounds and the plantsamples, process the spectral data, and add the spectral data and othercompound/sample information to the database of endogenous metabolites(FIG. 3).

[0212] In order that the spectral data collected for the compounds inthe database of endogenous metabolites accurately reflect the data forthe plant samples in the study, the compounds were prepared formetabolite analysis in a manner identical to that for the plant samplesin which the compound was expected to be present. The analyses performedwere one or more of: LC-MS, GC-MS, ICP-MS, and global assays (e.g. totalprotein, total carbohydrate, and total fat).

[0213] The spectral data entered into the database of endogenousmetabolites includes intensity, retention time, mass, and the like. Alink was established in the database between the compounds andassociated Peak_IDs for the various analysis technologies (LC-MS, GC-MS,ICP-MS, and global assays). In addition, information related to thestability of each compound generated according to the extraction andanalysis processes described herein was entered into the database. Whenavailable, basic information about the compounds was entered into thedatabase of endogenous metabolites such as name(s), molecular formula,structure, CAS #, vendors (if commercially available), molecular weight,and the like. Compounds in the database of endogenous metabolites werefurther described according to one or more of organism, tissue, celltype, treatment, disease state, phenotype, pathway(s), enzymaticreaction(s), and associated enzyme EC #.

[0214] Plant Tissue Sample Preparation Procedures

[0215] Minimal sample preparation was performed on plant tissues formetabolite analysis. Arabidopsis tissue (leaves, siliques, seeds) washarvested directly into tared and barcoded tubes (96-well format) inliquid nitrogen using an automated weighing station (Mettler-ToledoBohdan, Inc., Vernon Hills, Ill.). Samples were lyophilized withoutbeing allowed to thaw, mechanically ground to powder, and stored at lowhumidity (≦10%) until undergoing analysis. In the case of siliquesamples, polytetrafluorethylene (PTFE) was added at a ratio of 1:3(sample:PTFE) to facilitate the grinding and dispensing steps.Similarly, polytetrafluorethylene (PTFE) was added at a ratio of 1:5(sample:PTFE) to facilitate the grinding and dispensing steps for seedsamples.

[0216] For GC-MS, LC-MS, and ICP-MS analysis, the ground plant tissuewas dispensed into 96-well plates using a powder dispensing robot whichaspirates and dispenses a fixed powder volume of sample (ZinsserAnalytic GmbH, Frankfurt, Germany). Sample location in the plate wastracked by linking sample ID with plate ID in LIMS. The weight of thedispensed samples was re-measured and the actual sample mass values wereuploaded to the laboratory information management system (LIMS).

[0217] LC-MS Procedures

[0218] Approximately 10 mg of dried ground plant tissue were extractedin 0.5 mL 10% aqueous methanol containing labeled internal standards.Tissue was disrupted by a 30 second pulse of high level sonic energy(lithotripsy) at a maximum temperature of 30° C. The extract wascentrifuged at 4000 rpm for 2 minutes. The supernatant, diluted with anequal volumn of 50% aqueous acetonitrile (V/V) was chromatographed onC18 HPLC in an acetonitrile/water gradient containing 5 mM ammoniumacetate. Samples were passed through a splitter and the split flow wasinfused to turbo-ionspray ionization sources of two Mariner LC TOF massspectrometers (PerSeptive Biosystems Inc., Framingham, Mass.). Theionization sources were optimized to generate and monitor positive andnegative ions, respectively. The Total Ion Chromatogram (TIC) wasanalyzed for compounds with masses ranging from 80 to 900 Daltons (Da).The individual ion traces were used for both calibration andquantification. Relative amounts of the compounds were determined usingthe intensity and peak areas of individual ion traces. Isotopicallylabeled internal standards were used for peak area ratios, responsefactor determination, and normalization of data throughout theexperiments.

[0219] GC-MS Procedures

[0220] Approximately 10 mg of dried ground plant tissue samples in96-well plates were extracted and derivatized in-situ. The procedureyielded trimethylsilyl (TMS) derivatives for a variety of compoundsincluding organic acids, fatty acids, amino acids, sugars, alcohols, andsterols. The procedure involved a two-step derivatization using MSTFA(methyl trimethylsilyl trifluoroacetamide) in acetonitrile, acidifiedwith trifluoroacetic acid, followed by derivatization with a stronglybasic silylating agent such as TMSDMA (trimethylsilyldimethylamine). TMSderivatives were analyzed by gas chromatography with time-of-flight massspectrometry (GC/TOF-MS). Separations were conducted using a 50%phenyl-50% methyl stationary phase, helium carrier gas, and a programmedoven temperature that ramped from a starting temperature of 50° C. to afinal temperature of over 300° C. Compounds detected by GC-MS with anelectron impact (EI) ion source were cataloged based on Kovats retentionindices and mass-to-charge ratio (m/z) of the ions characteristic ofeach peak. Isotopically labeled internal standards were measured andsystem suitability checks were performed both prior to and throughoutsample analyses, assuring that instrument response remained withinstatistically derived limits of the initial calibration responses.

[0221] ICP-MS Procedures

[0222] Approximately 10 mg of plant tissue samples were digested with 1ml of aqua regia by overnight digestion at 60° C. Samples were passedthrough 45 μm glass fiber filters, diluted as needed and analyzed on aMicromass Platform ICP-MS (Waters Corp., Beverly, Mass.) with a LEAP CTCPAL autosampler (LEAP Technologies, Inc., Carrboro, N.C.). Systemsuitability checks were performed both prior to and during sampleanalyses.

[0223] Characterization and/or Identification of Compounds Present inPlant Tissue

[0224] Control plant tissue samples were analyzed repeatedly by eachspectral methodology as described above to determine statisticallysignificant baselines. The resulting data was processed forcharacterization of all possible peaks and the resulting data enteredinto the database of endogenous metabolites. In most cases the raw datawas processed using a deconvolution algorithm and the peaks present werecharacterized with retention times/indices and relative massintensities. The spectral data characteristics corresponding to the peaklist was compared to that for the existing metabolite database and thepeaks corresponding to known compounds were identified. For the peaksroutinely found in the plant samples, but not corresponding to anidentified compound, the compound formulas representing the spectraldata characteristics with the highest probability were entered into thedatabase of endogenous metabolites. The compounds indicated ascorresponding to the characterized but unidentified peaks were linked tometabolic reaction(s)/pathway(s) and the identities of the compoundsassociated with the pathways of greatest interest were validated (seeFIG. 3). A LECO Pegasus II GC/TOF-MS (LECO Corp., St. Joseph, Mich.) anda ThermoFinnigan ion trap GC-MS (PolarisQ) (Thermo Finnigan Corp., SanJose, Calif.) were used in conjunction with additional detector systems,such as an atomic emissions detector (AED) and an infrared (IR) detectorfor validation of compound identity. A list of compounds present in thedatabase of endogenous metabolites is set forth in Table 2.

SPECIFIC EXAMPLE 2 Creation of a Coherent Data Set for GroupingHerbicides by Site of Action

[0225] Described herein is an approach that integrates and standardizesthree types of data: gene expression, metabolite (or biochemical) data,and phenotypic (or morphologic) data, to capture a larger share ofcellular information than that which is otherwise available fromcollective results of the three data types. The resulting coherent datawas applied to the grouping of herbicides by SOA in Arabidopsis.Phenotypic, gene expression, and metabolite analysis was performed onArabidopsis tissues treated with 18 herbicides having nine differentsites of action (Table 3). Data types were standardized to allow forsimultaneous testing of all the data types or any combination of datatypes. Data were tested for the ability to accurately indicate thegrouping of the herbicides by common SOA. The results indicate that noindividual or pair-wise combination of the data types yielded thepredictive power achieved by combining all three data types into acoherent data set. TABLE 2 List of Compounds in Metabolite Database2,4,6TRIS(TRIFLUOROMETHYL)1,3,5-TRIAZ CHOLESTANE CAMPESTEROL2,6-DIBUTYL-4-METHYLPYRIDINE CHOLESTENONE CHOLESTADIENE 2-ISOPROPYLMALICACID CHOLESTEROL HYDROXYBENZOIC ACID 2-KETOBUTYRIC ACID CHOLIC ACIDHYPOXANTHINE 2-KETOGLUTARIC CHROMIUM INDIUM 2-PHENYL GLYCINE CINNAMICACID INDOLYLACETONITRILE 3,4-DIOH PHENYLALANINE CIS + TRANS EPOXYSUCCINIC INOSITOL ACID 3-NITRO-1,2,4-TRIAZOLE CIS-EPOXY SUCCINIC ACIDIODINE 4-AMINOBENZOIC ACID CITRACONIC ACID IRON 4-AMINOBUTYRIC ACIDCITRIC ACID ISOCITRIC ACID 4-FLUORO-L-PHENYLALANINE CITRIC ACIDTRIMETHYLESTER ISOLEUCINE 4-OH PHENYL PYRUVIC CITRULLINE ITACONIC ACID 41K COBALTJASMONIC ACID 43CA CONIFERYL ALCOHOL KOJIC ACID5-FLUOROINDOLE-2-CARBOXYLIC ACID COPPER L-ASPARTIC ACID6-BENZYLAMINOPUR. RIBO CORTISONE L-PROLINE 7-METHOXY COUMARIN CARBOYXLICACID CYSTATHIONINE L-RIBULOSE HYDRATE ACETYL GIBBERELLIC ACID CYSTEINELANOSTEROL ACIFLUORFEN CYTOSINE LAURIC ACID ACTINONIN DECANOIC ACID LEADADENINE DIAMINOPIMELIC ACID LEUCINE ADENOSINE DICYSTEINELEUCINE/ISOLEUCINE ADENOSINE 5′DI PO4 DIHYDROCHOLESTEROL LITHIUM ALANINEDIHYDROXYACETONE PO4 LUPEOL DIMETHYL KETAL ALLANTOIC ACID DIOSGENINLUTEOLIN ALLANTOIN DIPICOLINIC ACID LYSINE ALLUMINUM DOCOSANOIC ACIDMAGNESIUM AMINOADIPIC ACID EICOSANOIC ACID MALIC ACID ANTHRANILIC ACIDERGOCALCIFEROL MANGANESE ANTHRONE ERGOSTEROL MERCURY ANTIMONY ESTRONEMETHIONINE ARGININE FARNESOL METHYL STEARATE ARSENIC FLUORESCAMINEMETRIBUZIN ASCORBIC ACID FLUORESCEIN MEVALONIC LACTONE ASPARAGINE FOLICACID MOLYBDENUM ASPARTIC ACID FRUCTOSE MYRCENE BARIUM FUMARIC ACID N-C10BENZOIC ACID GALLIC ACID N-C12 BERYLLIUM GIBBERELLIC ACID N-C14 BETAINEGLUCOSE N-C16 BIOTIN GLUTAMIC ACID N-C18 BISMUTH GLUTAMINE N-C20 BIURETGLUTATHIONE N-C22 BORON GLYCINE N-C24 BRASSICASTEROL HISTIDINE N-C26CADMIUM HOMOCYSTEINE N-C28 CAFFEINE HOMOGENTISIC ACID CALCIUM HOMOSERINEN-C31 STRONTIUM N-C32 N-C34 HYDROCORTISONE SUCROSE N-C36 SULFOLANE N-C38SYNEPHRINE N-C40 TAURINE NAPTHOL TETRADECANOIC ACID NEROL THREONINENIACINAMIDE THYMINE NICKEL TIN NICOTINIC ACID TMS-PHOSPHATE NOPALINETRYPTOPHAN OCTADECADIENOIC ACID TYROSINE OCTADECANOIC ACID UNKNOWNOCTADECATRIENIOC ACID URACIL ORNITHINE URANIUM OROTIC ACID URIC ACIDOXALIC ACID DIMETHYL ESTER UROCANIC ACID OXALOACETIC ACID URSOLIC ACIDPALMITIC ACID VALINE PANTOTHENIC ACID VANADIUM PHENYL PYRUVIC ACIDZEATIN PHENYLALANINE ZINC PHOSPHATE a-TOCOPHEROL PHOSPHOENOLPYRUVATEg-TOCOPHEROL PHOSPHORUS g-TOCOPHEROL(un) PINITOL o-COUMARIC ACIDPIPECOLIC ACID p-COUMARIC ACID POTASSIUM SUCCINIC ACID PROGESTERONESTIGMASTEROL METHYL ESTER PROLINE STEARIC ACID PROTEIN STIGMASTEROLPYRIDOXINE N-C29 PYRUVIC ACID N-C30 QUINIC ACID SQUALENE QUINIC ACID1,3,4,5R SHIKIMIC ACID RAFFINOSE SILVER RETINOIC ACID SINAPINIC ACIDRIBOFLAVIN SITOSTEROL RIBOSE SALICYLIC ACID SELENIUM SERINE

[0226] TABLE 3 Herbicides Grouped According to Site of Action SymptomChemical Chemical Family Site of Action Suggested MOA Class 1 Glyphosate5-enolpyruvylshikimate-3- reduced photosynthetic 4 phosphatesynthase(EPSPS) intermediates via loss of feedback regulation 2Glufosinate — glutamine synthelase accumulation of ammonia 3 3Acifluorfen diphenylether protoporphyrinogen oxidase lipid peroxidation6 4 Bifenox diphenylether (protox) 6 5 Imazapyr imidazolinone ALSdepletion of ile,leu,val? 4 6 Imazethapyr imidazolinone 4 7Chlorosulfuron sulfonylurea 4 8 Atrazine triazine Qb binding proteinlipid peroxidation 7 9 Metribuzin triazine 7 10 Diuron phenylurea 7 11Bentazon benzothiadiazole 7 12 Paraquat bipyridinium accepts electronsfrom lipid peroxidation 7 13 Diquat bipyridinium photosystem I 7 142,4-D phenoxy acetic acid unknown auxin-like 5 15 Dicamba benzoic acid 516 Benazolin — 5 17 Amitrole — unknown (carotenoid unknown 2biosynthesis) 18 Metolachlor chloroacetamide unknown (very long chainfatty unknown 7 acids?)

[0227] The herbicide SOA study, also referred to as SOA1, was performedaccording to the procedures described below.

[0228] Herbicide Treatment

[0229]Arabidopsis thaliana plants were grown for 21 days and herbicideswere applied by spraying the foliage in a spray hood (HalltechEnvironmental, Inc, Guelph, Ontario). Herbicide stock solutions weremade in dimethylsulfoxide. Working solutions were made by diluting thestock solutions into 15% DMSO or 20% Tetrahydrofurfural alcohol, whilethe negative control contained a corresponding solution lackingherbicide. The minimum inhibitory concentration (MIC) was defined as theminimum concentration of herbicide that inhibited rosette growth by atleast 90% compared to mock treated control plants. The time required forplants to exhibit the full range of symptoms at the minumum inhibitoryconcentration of herbicide (Tmic) was measured. MIC and Tmic weredetermined from rosette measurements made every 3 days and dailyphotographs of plants sprayed with a series of two-fold dilutions. Foreach herbicide, treated and control plant tissue samples were harvestedat 10%, 30%, and 70% of Tmic. A separate flat of plants (approximately30) was used for each of the herbicide-treated, the mock-treated, andthe 10%, 30%, and 70% time points.

[0230] Sample Preparation

[0231] Plant tissue was harvested directly into bar-coded tubes (96-wellformat) in liquid nitrogen, lyophilized, ground to powder, and storedaccording to the procedures described in Specific Example 1. For GC-MS,LC-MS, and ICP-MS analysis, the ground plant tissue was dispensed into96-well plates as described in Specific Example 1, supra.

[0232] GC-MS, LC-MS, and ICP-MS Analysis Procedures

[0233] Each of the plant tissue samples was analyzed by GC-MS, LC-MS,and ICP-MS in a 96-well high-throughput format according to theprocedures described in Specific Example 1,, supra. Sample ID and allassociated data were linked through LIMS. The instrumentation used foranalysis was validated to ensure the reproducibility and reliability ofdata collected and processed in the platform.

[0234] Error models describing the calibration and validation of theinstrumentation were constructed to describe the properties of samplebehavior. BEEBE ET AL., CHEMOMETRICS: A PRACTICAL GUIDE 348 (1998). Thereliability and sensitivity of the high-throughput analytical techniques(GC-MS, LC-MS, HPLC, ICP) used in the present invention have beenpreviously demonstrated. Fiehn et al., Metabolite Profiling for PlantFunction Genomics, 18 NATURE BIOTECH. 1157-1161 (2000). The range ofdetection and the high-throughput nature of the metabolite analysisaffected the statistical treatment of the response data. The varianceacross a 96-well plate was measured to allow for the use of a singlereplicate injection for each sample. The instrumentation used wasqualified for a single replicate injection according to the proceduresdescribed as follows. The instrument qualification study was arandomized, parallel assignment of at least three known compounds atthree concentrations with a minimum of 12 randomized injections for eachcompound-concentration combination. A total of 108 injections were usedfor a complete 96-well study. The variance across a 96-well plate wasestimated in this manner. MILLER & MILLER, STATISTICS FOR ANALYTICALCHEMISTRY 227 (2d. ed., 1988). The minimum number of replicates requiredto achieve a power of 0.90, at a significance testing level of 0.05, wasestimated for a two-tailed analysis of variance test according to Sokaland Rohlf. SOKAL & ROHLF, BIOMETRY: THE PRINCIPLES AND PRACTICE OFSTATISTICS IN BIOLOGICAL RESEARCH 887 (3d. ed., 1995).

[0235] In the case of LC-MS, a plurality of peaks (up to 300) wasdetected in both positive and negative mode in the control samples. Theions were likely due to (M+H)⁺ or (M+NH4)⁺ for positive mode and (M−H)⁻or (M−OAC)⁻ for negative mode. Exact molecular weights were calculatedusing previously assigned peaks. Mass spectrum profiles were evaluatedfor isotopic distribution primarily due to C₁₃ contributions, and themost likely elemental composition computed using nitrogen rule, isotopicratio contributions, and scanning molecular weight libraries. Allspectral data were entered into the database of endogenous metabolitesas described in Specific Example 1, supra.

[0236] GC-MS analysis of plant tissue samples was conducted using aThermoFinnigan Tempus GC/TOF-MS system (Thermo Finnigan Corp., San Jose,Calif.) including a small bore, capillary column (≦0.18 mm ID) with ahigh temperature 50% phenyl stationary phase. Column temperature wasprogrammed to ramp from an initial temperature of 50° C to over 300° C.Column effluent passed through a heated transfer line into a time offlight mass spectrometer equipped with an electron impact ion source.Calibration of the mass scale on the TOF-MS was performed withperfluorotributylamine (FC-43, PFTBA). Detector linearity was confirmedusing a paraffin mix at three different concentrations. Retention timesand chain lengths of the various hydrocarbons in the paraffin mix werealso used to generate Kovats retention indices.

[0237] Compounds detected in the plant tissue samples were catalogedbased on Kovats retention indices and mass-to-charge ratio (m/z) of theions characteristic of each peak. Typically, 50 to 100 major peaks weredetected in the total ion chromatograms (TICs) for the plant samples.Over 200 peaks were detected by using deconvolution techniques or bymanually selecting unique masses to isolate smaller peaks not readilyobserved in the TIC. All spectral data were entered into the database ofendogenous metabolites as described in Specific Example 1, supra.

[0238] Total Protein Assay Procedures

[0239] Plant tissue samples prepared as described above were extractedaccording to manufacturer's instructions (BCA-200 Protein Assay Kit,Pierce Biotechnology, Inc., Rockford, Ill.). Total protein assays wereperformed in a 96-well format using 10 μL tissue sample supernatant inaccord with manufacturer's instructions.

[0240] Gene Expression Analysis Procedures

[0241] Arrays of 60mer oligonucleotide probes were manufactured by usingnon-contact inkjet microarray printing technology (Agilent Technologies,Palo Alto, Calif.). 6200 A. thaliana genes were randomly selected. Anumber of genes were selected for randomized intra-array replication,and positive and negative control features were added, giving a total of8400 features on the microarray. RNA was extracted from lyophilized andpulverized tissue using TRIZOL reagent (Invitrogen Corp., Carlsbad,Calif.). Lyophilized tissues were first re-hydrated using RNALATER(Ambion, Inc., Austin, Tex.). The mRNA in the total RNA sample wasamplified, fluorescently labeled with either Cy3 (mock-treated) or Cy5(herbicide treated), and hybridized against microarrays for 17 hours at60° C. as according with the manufacturer's instructions (AgilentTechnologies, Palo Alto, Calif.). Final samples contained 200 ng of eachCy-labeled cRNA. Arrays were washed in 6×SSC, 0.005% TRITON X-102 at 60°C., in the same solution for 10 minutes at room temperature, and in0.1×SSC, 0.005% TRITON X-102 for five minutes at 4° C. The dried arrayswere scanned using an Agilent LP2 Scanner (Agilent Technologies, PaloAlto, Calif.). Images were analyzed using software supplied by themanufacturer (Feature Extraction software, Agilent Technologies, PaloAlto, Calif.) and the resulting data files were evaluated using RosettaRESOLVER software (Rosetta Inpharmatics, Inc., Kirkland, Wash.).

[0242] Experimental Design

[0243] Eighteen commercially available herbicides affecting ninedistinct sites of action were studied using phenotypic, biochemical, andgene expression analysis (Table 3). Of the nine identified sites ofaction (SOA), five were represented by at least two herbicides. Whenavailable, different chemical classes of herbicides affecting a commonsite of action were utilized. Tissue was sampled at 10% (early), 30%(middle), and 70% (late) of the time required for the full developmentof symptoms at the MIC of herbicide. The phenotypic, gene expression,and biochemical responses of herbicide-treated plants were compared tomock-treated controls. Data derived from tissues treated with herbicideshaving a SOA with at least two representatives formed a training set,while data derived from the four remaining herbicides with distinctsites of action formed a test set. The objective was to find a methodfor accurately predicting grouping by SOA for both data sets.

[0244] Phenotypic Analysis

[0245] As shown in FIG. 12, seven distinct morphological phenotypes wereobserved for the 18 herbicides studied. For the phenotypic analysis, upto twelve traits were measured for each group of herbicide treatedplants, and the data were expressed as numeric values standardized tothe average response for the mock treated tissues (Table 4). The twelvetraits measured were the following leaf characteristics for both new andold leaves: width, chlorosis, anthocyanin accumulation, necrosis,twisting, and curling. While phenotypic analysis indicated the accurategrouping by SOA for a majority of herbicides, in some cases very similarsymptoms were observed for herbicides affecting distinct sites ofaction. For example, leaf bleaching and leaf enlargement werecharacteristic of the carotenoid inhibitor, amitrole. Chlorosis and leafcurling were characteristic of the glutamine synthethase inhibitor,glufosinate. Necrotic leaf flecks were characteristic of theprotoporphyrinogen oxidase (PROTOX) inhibitors, bifenox and acifluorfen.The auxin inhibitors produced thin bent leaves often resembling apinwheel. However, both the PSII (Photo System I) (diuron, metribuzin,atrazine, and bentazon) and the PSI (Photo System I) (paraquat anddiquat) inhibitors caused rapid and widespread leaf necrosis presumablyvia a convergence in their lipid peroxidation-based mode of action.Similarly, both the acetolactate synthase (ALS) inhibitors (imazethapyr,imazapyr, chlorosulfuron) and the 5-enolpyruvylshikimate-3-phosphatesynthase (EPSPS) inhibitor (glyphosate) caused anthocyanin accumulationin the older leaves accompanied by chlorosis of the newly emergingleaves. Phenotypic analysis alone was insufficient to distinguish theherbicides by SOA. TABLE 4 Eleven Phenotypic Traits Measured for EachHerbicide Treated Group Herbicide Trait Dev. 1 2-4-D leafWidth −1 22-4-D matureLeafChlorosis 2 3 2-4-D newLeafChiorosis 1 4 2-4-DmatureLeafAnthocyanins 1 5 2-4-D newLeafAnthocyanins 0 6 2-4-DmatureLeafNecrosis 0 7 2-4-D newLeafNecrosis 0 8 2-4-D leafCurling 1 92-4-D leafTwisting 2 10 2-4-D tMic 2 11 2-4-D pointedLeaves 0 12Acifluor leafWidth 0 13 Acifluor matureLeafChlorosis 0 14 AcifluornewLeafChlorosis 0 15 Acifluor matureLeafAnthocyanins 0 16 AcifluornewLeafAnthocyanins 0 17 Acifluor matureLeafNecrosis 1 18 AcifluornewLeafNecrosis 3 19 Acifluor leafCurling 1 20 Acifluor leafTwisting 021 Acifluor tMic 1 22 Acifluor pointedLeaves 0 23 Amitrole leafWidth 224 Amitrole matureLeafChlorosis 3 25 Amitrole newLeafChlorosis 4 26Amitrole matureLeafAnthocyanins 0 27 Amitrole newLeafAnthocyanins 0 28Amitrole matureLeafNecrosis 0 29 Amitrole newLeafNecrosis 0 30 AmitroleleafCurling −1 31 Amitrole leafTwisting 0 32 Amitrole tMic 2 33 AmitrolepointedLeaves 0 34 Atrazine leafWidth −1 35 Atrazine matureLeafChlorosis1 36 Atrazine newLeafChlorosis 1 37 Atrazine matureLeafAnthocyanins 0 38Atrazine newLeafAnthocyanins 0 39 Atrazine matureLeafNecrosis 4 40Atrazine newLeafNecrosis 4 41 Atrazine leafCurling 1 42 AtrazineleafTwisting 0 43 Atrazine tMic 1 44 Atrazine pointedLeaves 1 45Benazoli leafWidth −2 46 Benazoli matureLeafChlorosis 0 47 BenazolinewLeafChlorosis 0 48 Benazoli matureLeafAnthocyanins 0 49 BenazolinewLeafAnthocyanins 0 50 Benazoli matureLeafNecrosis 0 51 BenazolinewLeafNecrosis 0 52 Benazoli leafCurling 2 53 Benazoli leafTwisting 254 Benazoli tMic 2 55 Benazoli pointedLeaves 0 56 Bentazon leafWidth −257 Bentazon matureLeafChlorosis 2 58 Bentazon newLeafChiorosis 2 59Bentazon matureLeafAnthocyanins 0 60 Bentazon newLeafAnthocyanins 0 61Bentazon matureLeafNecrosis 4 62 Bentazon newLeafNecrosis 4 63 BentazonleafCurling 2 64 Bentazon leafTwisting 0 65 Bentazon tMic 1 66 BentazonpointedLeaves 1 67 Bifenox leafWidth 0 68 Bifenox matureLeafChlorosis 069 Bifenox newLeafChlorosis 0 70 Bifenox matureLeafAnthocyanins 0 71Bifenox newLeafAnthocyanins 0 72 Bifenox matureLeafNecrosis 1 73 BifenoxnewLeafNecrosis 3 74 Bifenox leafCurling 1 75 Bifenox leafTwisting 0 76Bifenox tMic 1 77 Bifenox pointedLeaves 0 78 Chlorsul leafWidth −1 79Chlorsul matureLeafChlorosis 2 80 Chlorsul newLeafChlorosis 2 81Chlorsul matureLeafAnthocyanins 3 82 Chlorsul newLeafAnthocyanins 0 83Chlorsul matureLeafNecrosis 0 84 Chlorsul newLeafNecrosis 0 85 ChlorsulleafCurling 1 86 Chlorsul leafTwisting 1 87 Chlorsul tMic 2 88 ChlorsulpointedLeaves 0 89 Dicamba leafWidth −2 90 Dicamba matureLeafChlorosis 291 Dicamba newLeafChlorosis 0 92 Dicamba matureLeafAnthocyanins 0. 93Dicamba newLeafAnthocyanins 0 94 Dicamba matureLeafNecrosis 0 95 DicambanewLeafNecrosis 0 96 Dicamba leafCurling 2 97 Dicamba leafTwisting 2 98Dicamba tMic 2 99 Dicamba pointedLeaves 0 100 Diquat leafWidth −2 101Diquat matureLeafChlorosis 1 102 Diquat newLeafChlorosis 1 103 DiquatmatureLeafAnthocyanins 0 104 Diquat newLeafAnthocyanins 0 105 DiquatmatureLeafNecrosis 4 106 Diquat newLeafNecrosis 4 107 Diquat leafCurling2 108 Diquat leafTwisting 0 109 Diquat tMic 2 110 Diquat pointed Leaves1 111 Diuron leafWidth −2 112 Diuron matureLeafChlorosis 2 113 DiuronnewLeafChlorosis 2 114 Diuron matureLeafAnthocyanins 0 115 DiuronnewLeafAnthocyanins 0 116 Diuron matureLeafNecrosis 4 117 DiuronnewLeafNecrosis 4 118 Diuron leafcurling 1 119 Diuron leafTwisting 0 120Diuron tMic 1 121 Diuron pointedLeaves 1 122 Glufosin leafWidth −2 123Glufosin matureLeafChiorosis 3 124 Glufosin newLeafChlorosis 3 125Glufosin matureLeafAnthocyanins 0 126 Glufosin newLeafAnthocyanins 0 127Glufosin matureLeafNecrosis 0 128 Glufosin newLeafNecrosis 0 129Glufosin leafCurling 2 130 Glufosin leafTwisting 1 131 Glufosin tMic 1132 Glufosin pointedLeaves 1 133 Glyphosa leafWidth 0 134 GlyphosamatureLeafChlorosis 1 135 Glyphosa newLeafChlorosis 2 136 GlyphosamatureLeafAnthocyanins 3 137 Glyphosa newLeafAnthocyanins 1 138 GlyphosamatureLeafNecrosis 3 139 Glyphosa newLeafNecrosis 0 140 GlyphosaleafCurling 0 141 Glyphosa leafTwisting 0 142 Glyphosa tMic 2 143Glyphosa pointedLeaves 1 144 Imazapyr leafWidth 0 145 ImazapyrmatureLeafChlorosis 0 146 Imazapyr newLeafChlorosis 2 147 ImazapyrmatureLeafAnthocyanins 2 148 Imazapyr newLeafAnthocyanins 0 149 ImazapyrmatureLeafNecrosis 0 150 Imazapyr newLeafNecrosis 0 151 ImazapyrleafCurling 0 152 Imazapyr leafTwisting 0 153 Imazapyr tMic 2 154Imazapyr pointedLeaves 0 155 Imazetha leafWidth 0 156 ImazethamatureLeafChlorosis 0 157 Imazetha newLeafChlorosis 2 158 ImazethamatureLeafAnthocyanins 3 159 Imazetha newLeafAnthocyanins 0 160 ImazethamatureLeatNecrosis 0 161 Imazetha newLeafNecrosis 0 162 ImazethaleafCurling 1 163 Imazetha leafTwisting 1 164 Imazetha tMic 2 165Imazetha pointed Leaves 0 166 Metolach leafWidth −1 167 MetolachmatureLeafChlorosis 0 168 Metolach newLeafChlorosis 0 169 MetolachmatureLeafAnthocyanins 0 170 Metolach newLeafAnthocyanins 0 171 MetolachmatureLeafNecrosis 3 172 Metolach newLeafNecrosis 3 173 MetolachleafCurling 2 174 Metolach leafTwisting 1 175 Metolach tMic 2 176Metolach pointedLeaves 1 177 Metribuz leafWidth −2 178 MetribuzmatureLeafChlorosis 2 179 Metribuz newLeafChlorosis 2 180 MetribuzmatureLeafAnthocyanins 0 181 Metribuz newLeafAnthocyanins 0 182 MetribuzmatureLeafNecrosis 4 183 Metribuz newLeafNecrosis 4 184 MetribuzleafCurling 1 185 Metribuz leafTwisting 0 186 Metribuz tMic 1 187Metribuz pointedLeaves 1 188 Paraquat leafWidth −1 189 ParaquatmatureLeafChlorosis 1 190 Paraquat newLeafChlorosis 1 191 ParaquatmatureLeafAnthocyanins 0 192 Paraquat newLeafAnthocyanins 0 193 ParaquatmatureLeafNecrosis 4 194 Paraquat newLeafNecrosis 4 195 ParaquatleafCurling 2 196 Paraquat leafTwisting 0 197 Paraquat tMic 2 198Paraquat pointedLeaves 1

[0246] Gene Expression Analysis

[0247] Gene expression responses were measured for the plant tissuestreated with each of the 18 herbicides and the average responsecalculated for each herbicide. The average response for each herbicidetreatment was standardized to the average response for the respectivemock treated tissue creating gene expression profiles for each of the 18herbicide treatments at each of the three time points. The geneexpression profiles for the herbicide treated tissues were based onsignificant changes in gene expression (generally greater than 2-fold)relative to control samples, for a plurality of genes (300 to 1000). Thegene expression responses were expressed in units of standard deviationsrelative to the control mean.

[0248] Herbicidal SOA was not readily deduced from examination of geneexpression. For example, the SOA for three of the herbicides in thestudy is ALS, an enzyme used in the synthesis of isoleucine, leucine,and valine from pyruvate. ALS is part of a pathway consisting of eightgenes, six of which were included on the array. Of the genes on thearray, three were found to be significantly up-regulated in the geneexpression profiles of the tissues treated with the ALS-targetingherbicides. Likewise, two herbicides used in the study target PROTOX, anenzyme utilized in heme biosynthesis. In the case of heme biosynthesis,22 enzymes are known to convert glutamate to heme and chlorophyll. Genesencoding 10 of the 22 enzymes were on the array, and 3 of the 10 genesdisplayed two to three-fold decreased expression in the profiles of thetissues treated with the PROTOX-targeting herbicides. Thus, it isdifficult to deduce SOA from the differential expression of a few genesin a profile containing hundreds, when just a subset of the genes in thetarget pathway are altered and many genes in other pathways show muchgreater fluctuations in expression. Experimental error and lack ofaccurate and comprehensive gene annotation further complicated theanalysis.

[0249] Although the gene expression analysis failed to conclusivelyindicate herbicide SOA, the gene expression data were tested for abilityto predict the grouping of herbicides by SOA. The data were analyzed forhierarchical clustering according to common changes in gene expression.Clustering was performed with SAS PROC CLUSTER (SAS Institute, Inc.,Cary, N.C.), using agglomerative hierarchical clustering with Ward'sminimum-variance method on standardized data, to adjust for differentranges of response. SAS PROC TREE (SAS Institute, Inc., Cary, N.C.), wasused to produce dendrograms of SOA (see FIG. 13). The data wereclustered on the set of genes observed in all herbicide treatmentgroups, as the clustering algorithm did not allow missing values.

[0250] Similar to that observed for the phenotypic profiles, clusteranalysis of the gene expression profiles failed to accurately group theherbicides by common SOA (see FIG. 13). In addition, the predictedclustering by gene expression changed with the time of tissueharvesting. Use of the middle time point data resulted in the accurategrouping of 4 of the 5 sites of action (represented by more than oneherbicide). Only the grouping of the two PROTOX inhibitors was notindicated with the middle time point data. The late time point data wasthe least indicative of the SOA. The early and middle time point dataresulted in the strongest clustering of the PSII and ALS inhibitors,whereas, the middle and late time point data resulted in the bestgrouping of the auxin and PROTOX inhibitors.

[0251] In some cases the clustering between herbicides with differingsites of action was stronger than for herbicides with the same SOA. Forexample, diquat is a PSI inhibitor, whereas acifluorfen and bifenox arePROTOX inhibitors, and metolachlor is neither a PSI nor a PROTOXinhibitor (unpublished data). However, the gene expression profilecorrelation between metolachlor and diquat (r=0.569) and the correlationbetween metolachlor and bifenox (r=0.499) were both higher than thecorrelation of bifenox to acifluorfen (r=0.151), which have the sameSOA.

[0252] In addition, herbicides of different chemical class but with acommon site of action were accurately grouped by gene expressionanalysis in some cases, while herbicides of the same chemical class andcommon site of action were not. For example, the early and middle timepoint data indicated the correct grouping of the PSII and the ALSinhibitors represented by different chemical classes of herbicides. ThePSII inhibitors consisted of the benzothiadiazole (bentazon), triazines(atrazine and metribuzin), and phenylurea (diuron) and the ALSinhibitors consisted of sulfonylurea (chlorsulfuron) and imidazolinones(imazapyr and imazethapyr). In contrast, clustering was not indicated atany time point for the two PROTOX inhibitors of the same chemical class(diphenylether). The results of the cluster analysis of the geneexpression profile data indicate either the need for optimization oftime of sampling or the limited utility of a single sampling point inpredicting herbicide SOA.

[0253] Evidence for similarities in profiles based on mode of action(MOA) rather than SOA is less clear. The PSII, PSI, and PROTOXinhibitors have distinct sites of action but are thought to have acommon mode of action (MOA) through the generation of reactive oxygenspecies that promote lipid peroxidation. DEVINE ET AL., PHYSIOLOGY OFHERBICIDE ACTION (1993). However, when the data for the herbicides werecompared, strong clustering was observed at the early time point betweenthe PSI inhibitors, bifenox (one of the PROTOX inhibitors), andmetolachlor (unknown MOA), but the PSII inhibitors did not cluster withthis group. At the latest time point, some clustering occurred betweenthe PSII and PROTOX inhibitors, but not with the PSI inhibitors. Geneexpression analysis alone was insufficient to distinguish the herbicidesby SOA or MOA.

[0254] Biochemical (Metabolite) Profiling

[0255] The same samples subjected to gene expression analysis were alsoexamined using biochemical, or metabolite, analysis. Biochemicalresponses were measured for the plant tissues treated with each of the18 herbicides and the average response calculated for each herbicide.The average response for each herbicide treatment was standardized tothe average response for the respective mock treated tissue creatingbiochemical profiles for each of the 18 herbicide treatments at each ofthe three time points. The biochemical profiles were expressed in unitsof standard deviations relative to the control mean (data not shown).

[0256] In general, the predictive power of the metabolite data displayedmany of the limitations observed for the gene expression data. The lackof comprehensive peak identification prevented inference of SOA from thebiochemical responses. The metabolite data were tested for ability topredict the grouping of herbicides by SOA. The data were analyzed forhierarchical clustering according to common changes in biochemicals.Clustering was performed with SAS PROC CLUSTER (SAS Institute, Inc.,Cary, N.C.), using agglomerative hierarchical clustering with Ward'sminimum-variance method on standardized data, to adjust for differentranges of response. SAS PROC TREE (SAS Institute, Inc., Cary, N.C.), wasused to produce dendrograms (FIG. 13). The data were clustered on theset of biochemicals observed in all herbicide treatment groups, as theclustering algorithm did not allow missing values.

[0257] Similar to that observed for the phenotypic and gene expressiondata, cluster analysis of the metabolite data failed to accuratelypredict the grouping of the herbicides by common SOA (FIG. 13). In thecase of the biochemical profile data, use of the late time point datafor the cluster analysis resulted in the most accurate grouping of theherbicides by SOA and the early time point data were the leastindicative of SOA. For the late time point data, three of the five sitesof action (represented by more than one herbicide) were accuratelygrouped. None of the biochemical time point data indicated the groupingof the two PROTOX inhibitors and the late time point biochemical datafailed to cluster the two PSI inhibitors. Similar to that observed forthe gene expression analysis, the correlation of the biochemicalresponses of herbicides having different sites of action is oftengreater than the correlation between the responses of herbicides havingthe same SOA. Clustering by MOA based on the biochemical responses wasless clear than for SOA. The data indicate that biochemical analysisalone is insufficient to distinguish the herbicides by SOA or MOA.

[0258] Combination of Profiling Technologies

[0259] Neither phenotypic, gene expression, nor metabolite analysisalone is sufficient to infer herbicidal SOA. Using data from any singletechnology resulted in inaccurate groupings of the herbicides by SOA. Asa result, the data from two and three of the technologies were combinedand tested to determine whether analysis of the combined data wouldimprove herbicide classification by SOA.

[0260] For the three different technologies, the data were firstexpressed as standardized differences from controls as described above.Each data point represents a distance or degree (in units of standarddeviations) a particular observation on a treated sample was from thecorresponding observation on a control sample. To reduce thedimensionality of the data and to approximately weight equally the datafrom the three technologies, principle components analysis was performedseparately on the phenotypic, biochemical, and gene expression profiles,using SAS PROC PRINCOMP (SAS Institute, Inc., Cary, N.C.). Geneexpression and metabolite data were taken from the early and late timepoints, respectively. Principle components analysis was applied tobalance the data, as gene expression profiling provides an order ofmagnitude more data points than biochemical profiling. The applicationensured that the two platforms were given approximately the same weightin further analysis. The analysis procedure resulted in 45 principlecomponents (17 from gene expression profiling, 17 from biochemicalprofiling, and 11 from phenotypic profiling). The expression of thephenotypic, gene expression, and biochemical profile data in a commonunit system allowed for simultaneous testing of any subset orcombination of the data by analysis methods such as cluster analysis,discriminant analysis, or correlation analysis.

[0261] To assess the ability to predict the accurate grouping ofherbicides according to SOA, pairwise combinations of the principlecomponent data from each technology were tested using correlationanalysis (FIG. 8). The results of testing data from pairs oftechnologies, such as gene expression and biochemical profiles,phenotypic and biochemical profiles, and phenotypic and gene expressionprofiles, while more accurate than the predictions from any singletechnology, still failed to indicate the correct grouping of theherbicides by SOA.

[0262] In contrast, 100 percent accuracy in grouping of the herbicidesby SOA resulted when the data from all three technologies were combinedas a coherent data set (FIG. 14). The data in FIG. 14 were derived usingdiscriminant analysis. The principle components for each technology wereused to derive a linear discriminant rule using SAS PROC DISCRIM withequal priors. The four herbicides with either unknown or singular sitesof action were used to form a test set, and the data for the otherfourteen herbicides formed the training set (Table 3). The discriminantrule was derived on the training set only. Prior to application, thediscriminant rule was validated on the test set. The rule correctlyindicated that the test herbicides did not belong to any class ofherbicide represented in the training set. The rule was cross-validatedagainst the training set as follows: each herbicide was serially removedfrom the training set, a new rule was derived from the remaining data,and the removed herbicide was classified on the new rule. Thecross-validation displayed 100 percent correct classification of theherbicides.

[0263] Attempts to discriminate between different sites of action usingthe principle components from any one platform or any pair of platformswas less than 100 percent successful. For gene expression data alone,the error rates were 100 percent on cross-validation, 0 percent on testdata. For metabolite data alone, the error rates were 93 percent oncross-validation, 0 percent on test data. For phenotypic data alone, theerror rates were 0 percent on cross-validation, 25 percent on test data.Discriminant analysis on data from pairs of technologies had error ratesranging from 40 to 100 percent on cross-validation, and 0 percent errorrate on test data.

[0264] This analysis shows that the 45 principle components derived fromgene expression, biochemical, and phenotypic profiling are 100 percentaccurate in distinguishing between herbicides with different sites ofaction. To visualize the results, a three-dimensional plot of the firstprinciple components from the three platforms was made usingDECISIONSITE software (Spotfire, Inc., Somerville, Mass.) (FIG. 14).FIG. 14 depicts the data in three dimensions where the first principalcomponent of each profiling technology is represented on one axis. Theprinciple components were used to derive a linear discriminant ruleusing SAS PROC DISCRIM with equal priors. The rule indicated 100%correct classification of the herbicides by SOA. FIG. 14 reveals thateach SOA class is part of a discrete group, easily distinguishable fromall other classes. (Note: The depiction of the FIG. 14 graph is, bynecessity, dimensionally reduced for the purpose of visualization;resolution between herbicide classes is even greater than what isrepresented in FIG. 14 when all principle components are considered inthree dimensions).

[0265] The results of the foregoing study show that it is possible toaccurately predict the SOA of herbicides using a combination oftechnologies when the SOA is represented in an existing database. Thesuperior predictive power of combining three disparate data sourcesrelative to the use of one or even two sophisticated and high resolutionprofiling technologies was demonstrated. It follows that the strategyset forth herein, of standardizing and combining disparate data intocoherent data sets for the analysis of biological samples (FIG. 10),will increase the predictive power of the analysis. The strategy isapplicable to any experimental system and any data or technology,including alternatives not explored herein, such as protein expressionand activity profiling.

SPECIFIC EXAMPLE 3 Herbicide Mode-of-Action Analysis

[0266] Herbicides have contributed extensively to increases in cropyield by eliminating or reducing the impact of competitive plantspecies. Although there are presently numerous registered compoundsmarketed in thousands of commercial products, there remains a need fornew active herbicidal ingredients. Factors that contribute to the needfor new active ingredients include the development ofherbicide-resistant plant species and stricter regulations for reducingtoxicological and environmental effects.

[0267] Understanding the mode-of-action and more specificallyidentifying the site- or pathway-of-action of existing and newherbicidal candidates is extremely valuable. Identification of thetarget(s) of a herbicidal compound prompts many options that may affectthe decision for continued development of that compound. For example, ifthe target is not novel, continued work on the candidate compound may bestopped. Conversely, additional screening against the target may yieldother novel herbicidal chemistries with more desirable traits (e.g.better efficacy, a more favorable environmental fate, and the like).Additionally, selectivity with respect to non-target organisms can bepredicted by bioinformatic analysis.

[0268] In the instant specific example of the present invention(hereinafter MOA1), phenotypic, metabolite, and gene expression analysiswere used to assess the effect of five unknown herbicidal compounds(Unknowns 1-5) on Arabidopsis thaliana. Plants were sprayed withrecommended concentrations of each unknown compound and tissue sampleswere collected 20 and 60 minutes after exposure. Treated tissues wereprocessed and subjected to gene expression and metabolite, orbiochemical, profiling. In a similar fashion, samples were subjected tobiochemical profiling from plants that had been sprayed with 18commercially known herbicides. A subset of the samples sprayed with thecommercially known herbicides were also analyzed by gene expressionprofiling. A set of plants treated with each compound was subjected to aseries of phenotypic assessments five days after treatment. Finally, allunknown and a subset of commercial compounds were also analyzed using afungal nutritional profiling platform.

[0269] The data were analyzed in several ways. First, the profilingresults for each compound were examined individually. Next, within eachtechnology or process (gene expression analysis, biochemical analysis,and phenotypic analysis), comparisons were made within the group ofunknown compounds and with the group of commercially known compounds.The results from the fungal nutritional profiling were used to guideanalysis of the gene expression and metabolite analysis data. The laststep of the experiment was to combine the data sets from the threetechnologies (gene expression analysis, biochemical analysis, andphenotypic analysis) to perform a global analysis of the herbicidalcompounds.

[0270] Development of Spraying Method and Formulation

[0271] Control studies were conducted to improve the efficacy ofcompound application and minimize compound utilization. First, standardmethodologies for application of each herbicidal compound were modifiedto reduce the amount of compound required per sample. Second, compoundformulation was modified to optimize plant response to the test compoundwhile minimizing secondary effects.

[0272] Spraying Methods

[0273] Plants were grown under short day conditions for 39 days prior tospraying with various herbicides. Under these conditions, the wholerosette for each plant provides approximately 150 mg dry weight materialfor analysis. Whole rosette leaves from two to four plants were pooledfor each sample to reduce the influence of biological variation. Plantsamples were flash frozen in liquid nitrogen and stored at −80° C. untilfurther use. Frozen leaf tissue was lyophilized and an aliquot of thelyophilized tissue (˜10 to 25 mg) was used to extract total RNA as knownin the art (see e.g., SAMBROCK ET AL., MOLECULAR CLONING (1989); AUSUBELET AL., (EDS.) CURRENT PROTOCOLS IN MOLECULAR BIOLOGY (1994)) andmetabolites as described in Specific Examples 1 and 2, supra.

[0274] Each plant was sprayed with herbicide concentrations equivalentto the recommended dosage of application under field conditions. Thiswas achieved by converting kg/ha dosage to mg/ml as follows:

[0275] 1 flat=32 plants=1352 cm²

[0276] 1 hectare (ha)=10,000 m²

[0277] Therefore, 1 plant=4.22×10⁻⁷ ha. 1.0 kg/ha requires 0.42 mgherbicide/plant. Thus, 1.0 kg/ha=0.5 ml per plant at 0.84 mg/ml.

[0278] For each compound, six plants were sprayed with 3 ml of solution.Two plants were harvested each at 20 minutes and 1 hour, while theremaining plants were maintained for phenotypic profiling.

[0279] Treatment of Arabidopsis with Unknown and Commercial Compounds

[0280] Five unknown compounds and 18 commercially known herbicides thatbelong to different chemical families were prepared in a solutioncontaining 0.01% Tween 80 and 3.4% dimethylsulfoxide (DMSO). The 18commercial herbicides represent 13 different modes-of-action based onthe Herbicide Resistance Action Committee (HRAC) classification schemeand 17 different modes-of-action based on the Weed Science Society ofAmerica (WSSA) classification scheme (Table 5). Commercial herbicideswere included in the study for validation and comparative analysispurposes. The control samples contained Tween 80 and DMSO only. Allunknown compounds were sprayed at a concentration equivalent to 1.0kg/ha. All commercial compounds were sprayed at maximum field dose (MFD)or at 1.0 kg/ha if MFD data was not available (Table 5). For eachcompound, six plants were sprayed using an artist airbrush at a rate of0.5 ml/plant. At 1.0 kg/ha, the amount of unknown compound required tospray six plants was 2.54 mg, based upon two timepoints and two plantsfor assessment of symptomology. TABLE 5 List of Commercial HerbicidesActive Conc. WSSA HRAC Ingredient Mode of Action Chemical Family (kg/ha)Group Group Chlorsulfuron Inhibition of acetolactate synthase ALSSulfonylureas 0.02 2 B Imazapyr Inhibition of acetolactate synthase ALSImidazolinones 1.70 2 B 2,4-D Action like indole acetic acid (syntheticPhenoxy- 1.00 4 O auxins) carboxylic-acids Atrazine Inhibition ofphotosynthesis at Triazines 4.00 5 C1 photosystem II Bentazon Inhibitionof photosynthesis at Benzothia- 2.24 6 C3 photosystem II diazinoneButylate Inhibition of lipid synthesis - not Thiocarbamates 4.00 8 NACCase inhibition Glyphosate Inhibition of EPSP Synthase Glycines 4.00 9G Glufosinate Inhibition of glutamine synthetase Phosphinic acids 1.7010 H Amitrole Bleaching: Inhibition of carotenoid Triazoles 2.00 11 F3biosynthesis (unknown target) Norflurazon Bleaching: Inhibition ofcarotenoid Pyridazinone 4.00 12 F1 biosynthesis at the phytoenedesaturase step (PDS) Acifluorfen Inhibition of protoporphyrinogenDiphenylethers 0.42 14 E oxidase (PPO) Metolachlor Inhibition of celldivision (Inhibition of Chloroacetamides 4.00 15 K3 VLCFAs) AsulamInhibition of DHP (dihydropteroate) Carbamates 3.00 18 I synthaseNaptalam Inhibition of auxin transport Phthalamates 4.00 19 PSemicarbazones Isoxaben Inhibition of cell wall (cellulose) Benzamides1.20 21 L synthesis Paraquat Photosystem-1-electron diversionBipyridyliums 0.53 22 D Chloropropham Inhibition of mitosis/microtubuleCarbamates 2.00 23 K2 organisation Isoxaflutole Bleaching: Inhibition of4- Isoxazoles 1.00 28 F2 hydroxyphenyl-pyruvate-dioxygenase (4-HPPD)

[0281] Biochemical Profiling (or Metabolite Profiling): LC-MS Analysis

[0282] Lyophilized tissue was disrupted by grinding for 5 minutes at1800 rpm using a grinder and stored in a controlled environment untilfurther analysis. Approximately 10 mg of dried ground tissue wasextracted in 0.5 ml 10% aqueous methanol containing isotopically labeledinternal standards. The extract was centrifuged at 4000 rpm for 2minutes, diluted with an equal volume of 50% aqueous acetonitrile (V/V),and transferred to a temperature-controlled autosampler (4° C.) of aHP1100, HPLC system (Agilent Technologies, Palo Alto, Calif.).

[0283] The sample was fractionated on a C¹⁸ HPLC column in anacetonitrile/water gradient containing 5 mM ammonium acetate. Afterchromatography, the sample was passed through a splitter and the splitflow was infused to the turbo-ionspray ionization sources of two MarinerLC-time of flight mass spectrometers (PerSeptive Biosystems Inc.,Framingham, Mass.). The ion sources were optimized to generate andmonitor positive and negative ions respectively.

[0284] The Total Ion Chromatogram (TIC) of the metabolic profile wasanalyzed for metabolites with masses ranging from 80 to 900 Daltons(Da). The individual ion traces of the extracted mass chromatogram ofthe (M−H)⁻ (negative) and (M+H)⁺ (positive) ions were used for bothcalibration and quantification. Relative amounts of the compounds wereobtained by determining the intensity and peak areas of individual iontraces. Isotopically labeled internal standards were used for peak arearatios, response factor, and normalization of data throughout theexperiment.

[0285] GC-MS Analysis

[0286] Approximately 10 mg of dried ground tissue was extracted with 25%v/v N-methyl-N-trimethylsilyl-trifluoroacetamide (MSTFA) and 0.1% v/vtrifluoroacetic acid in acetonitrile. Samples were derivatized in 50%N,N-Dimethyltrimethylsilylamine (TMS-DMA), 25% acetonitrile, and 25%1,2-dimethoxyethane followed by addition of 1,4-Dioxane. Precipitateswere removed by centrifugation and the supernatants were used foranalysis.

[0287] Gas chromatography was performed on a ThermoFinnigan Trace2000 GC(Thermo Finnigan Corp., San Jose, Calif.) equipped with an autosamplerand a split/splitless injection port. The gas chromatograph was coupledto a ThermoFinnigan Tempus time-of-flight mass spectrometer (ThermoFinnigan Corp., San Jose, Calif.) fitted with an electron impact (EI)ion source. Chromatographic separations were conducted using a 50%phenyl/50% methyl polysiloxane stationary phase, helium carrier gas, anda programmed oven temperature that ramped from a starting temperature of50° C. to a final temperature of over 300° C. Analyses were conductedwith 1 μL injection volumes in split mode with a split ratio of 50:1.Electron impact mass spectra were acquired at 70 eV, at rate of 10spectra/second, over the range m/z 41 to 640. Paraffins used asretention standards for calculating retention indices were prepared bydiluting a Florida TRPH standard (Restek Corp., Bellefonte, Pa.) to aworking concentration of 25 μg/mL each in methyl tert-butyl ether with0.005% v/v tetramethylene sulfone as an internal standard.

[0288] Compounds detected by GC-MS were cataloged based on Kovatsretention indices and mass-to-charge ratio (m/z) of the ionscharacteristic of each peak. The instrument response for each analyticalpeak was expressed as a relative response of the selected quantitationion for that peak to the detector response for tetramethylene sulfone atm/z 120.

[0289] Peak Characterization and Identification

[0290] For both GC-MS and LC-MS analysis, peaks present in Arabidopsissamples were characterized and/or identified: (1) Metabolites known tobe of interest were run as standards so that the correspondingmetabolites present in the tissue samples could be identified; and (2)Peaks which were observed to appear regularly and repeatedly inArabidopsis tissue but not corresponding to an identified metabolitewere characterized in terms of their spectral properties. These combinedmethods led to the characterization and/or identification of severalhundred peaks in LC-MS and GC-MS together.

[0291] Gene Expression Profiling

[0292] RNA was extracted from lyophilized and pulverized tissue usingTRIZOL reagent (Invitrogen Corp., Carlsbad, Calif.). Lyophilized tissueswere first re-hydrated using RNALATER (Ambion, Inc., Austin, Tex.).Arrays of 60mer oligonucleotide probes were manufactured by AgilentTechnologies using non-contact inkjet microarray printing technology(Agilent Technologies, Palo Alto, Calif.). A total of 22,000 A. thalianagenes were spotted onto the array. A number of genes were selected forrandomized intra-array replication, and positive and negative controlfeatures were added. The mRNA in the total RNA sample was amplified,fluorescently labeled with either Cy3 or Cy5, and hybridized againstmicroarrays as described by the manufacturer (Agilent Technologies, PaloAlto, Calif.). Arrays were scanned using a LP2 Scanner (AgilentTechnologies, Palo Alto, Calif.). Images were analyzed using FeatureExtraction software (Agilent Technologies, Palo Alto, Calif.). Theresulting data files were evaluated using Rosetta RESOLVER software(Rosetta Inpharmatics, Inc., Kirkland, Wash.).

[0293] Phenotypic Profiling

[0294] Two plants from each treatment were maintained for phenotypicprofiling. Images were taken daily for one week and then every other dayfor the following week. Eleven phenotypic characteristics (data notshown) were assessed at the time point showing maximal symptomology foreach herbicide. The phenotypic scores were used for cluster analysis ofunknown and commercial herbicides.

[0295] Fungal Nutritional Profiling

[0296] The inventors have developed a profiling process for chemicalmode-of-action analysis utilizing the filamentous fungus, Magnaporthegrisea. Filamentous fungi have the ability to utilize numerous carbonand nitrogen sources and they can utilize many nutrients as supplementsfor auxotrophic requirements. These attributes are useful for examiningthe effects of chemicals on the growth of M. grisea under a variety ofmedia conditions. Loss or gain of the ability to utilize a specificnutrient(s) in the presence of a test compound can provide valuableinformation relating to the pathways that are targeted by that compound.Because plants and filamentous fungi have many metabolic pathways incommon, the results obtained from analysis in fungi can sometimes beused to predict the effect of the test compound on a plant.

[0297] Typically, candidate chemicals submitted for MOA analysis are notavailable in large quantities. To minimize the amount of a particularcompound required for analysis, a tiered nutritional profiling analysisprotocol has been developed in which several nutrients are combined into“pools” for testing. A positive result in one pool triggersdeconvolution of that pool into sub-pools or individual nutrients fortesting. Using this approach, the total number of growth tests can bereduced approximately five- to ten-fold as compared to testing allnutrients independently.

[0298] The initial nutrient pool for the present experiments includedamino acids, purines, pyrimidines, and various vitamins and cofactors.The growth conditions were designed to test for both auxotrophyrequirements and utilization as nitrogen sources.

[0299]M. grisea spores were inoculated into a minimal media with orwithout nutrient supplementation. Test compounds were added at theminimal inhibitory concentration (MIC) or at a relatively high dose ifno growth inhibition was observed in the concentration range tested.Spore suspensions were aliquoted into microtiter plates and incubatedfor seven days at 25° C. Optical density (OD) measurements at 590 nmwere taken daily during the incubation period. Supplemented and minimalmedia growth were compared to untreated controls for each test compound.A difference between the growth kinetics in control versus treatmentindicated that a nutrient utilization pathway was affected. Continueddeconvolution of the pools was performed as necessary to identifyspecific nutrient(s) contributing to the growth response observed.

[0300] Phenotypic Profiling

[0301] Eleven phenotypic characteristics, identical to the ones listedin Table 6, were assessed for each of the five unknown compounds and thecommercial herbicides sprayed with Tween 80. The results for the unknowncompounds are shown in Table 6. TABLE 6 Symptoms scores for the FiveUnknown Compounds Leaf Mature Leaf New Leaf Mature leaf New leaf Matureleaf New leaf Leaf Leaf Pointed Cmpd width chlorosis chlorosisanthocyanins anthocyanins necrosis necrosis curling twisting Tmic^(b)Leaves Unknown 0 0 4 0 0 0 0 0 0 2 0 1 Unknown 0 0 0 0 0 3 0 0 0 2 0 2Unknown 0 0 0 0 0 3 2 1 1 1 0 3 Unknown 0 2 2 0 0 0 0 0 1 2 0 4 Unknown0 0 0 0 0 2 0 0 1 2 0 5

[0302] Hierarchical cluster analysis of the eleven phenotypiccharacteristics was used to visualize the relationship of the fiveunknown compounds to the commercial herbicides using Ward's method inSpotFire DecisionSite 7.0 (Spotfire, Inc., Somerville, Mass.). Asexpected, inhibitors of photosynthesis machinery and protoporphyrinoxidase clustered together, as did both of the ALS inhibitors. Thebleaching herbicides also clustered closely although both glyphosate andglufosinate clustered with amitrole. This observation is consistent withthe observation that amitrole exhibited chlorosis and not truebleaching. Unknown 1 clustered with carotenoid biosynthetic inhibitors,which result in a bleaching phenotype. Unknown 4 showed a strongchlorotic phenotype and did not group in theglyphosate/glufosinate/amitrole dade known to induce necrosis. Unknown2, Unknown 3, and Unknown 5 grouped in a cluster containing commercialcompounds that did not show strong phenotypes under our conditions.

[0303] Biochemical Profiling

[0304] A combined total of 716 peaks from the LC-MS (positive andnegative modes) and GS/MS were examined for each treatment and timepoint. In the 20 minutes and 1 hour time point data, a total of 168 and176 peaks, respectively, were determined as significantly different fromthe control (p<0.11) in at least one of the treatments. Of these, 69 and78 peaks, respectively, could be identified as a specific metabolite.The number of metabolites whose abundance was significantly altered inthe treated samples relative to the control samples are shown in Table7. TABLE 7 Regulated Metabolites Following Herbicide Treatment Number ofMetabolites Changed: Total #(unknown #) 20 min, 20 min, 1 hr, 1 hr,Compound p < 0.05 p < 0.11 p < 0.05 p < 0.11 Unknown 1 7(5) 13(8) 15(12)49(29) Unknown 2 4(2) 6(4) 4(2) 23(12) Unknown 3 4(2) 9(6) 4(3) 10(9)Unknown 4 5(2) 8(3) 20(10) 47(24) Unknown 5 5(3) 12(9) 2(1) 6(5) 2,4-D3(2) 13(8) 7(3) 25(15) Acifluorfen 17(11) 32(20) 19(13) 31(22) Amitrole9(6) 17(13) 14(8) 32(20) Asulam 9(3) 14(5) 10(7) 18(11) Atrazine 4(4)11(8) 17(8) 49(24) Bentazon 9(3) 10(6) 11(9) 19(14) Butylate 17(10)33(21) 18(12) 31(22) Chloropropham 12(8) 17(10) 12(5) 20(10)Chlorsulfuron 9(4) 15(6) 5(3) 23(13) Glufosinate 33(25) 48(33) 4(4) 9(9)Glyphosate 4(2) 13(9) 16(9) 46(24) Imazapyr 5(2) 9(3) 8(6) 14(10)Isoxaben 26(12) 45(21) 25(16) 55(33) Isoxaflutole 45(29) 62(39) 14(13)25(21) Metolachlor 38(25) 54(34) 18(11) 45(22) Naptalam 28(11) 39(16)13(9) 46(23) Norfluazon 38(27) 55(36) 6(5) 12(8) Paraquat 9(5) 14(9)22(15) 50(31)

[0305] Since Unknown 4 treatment induced larger perturbations in themetabolite pool size, the data were sorted based on Unknown 4 results.Only two peaks (nLCcmpd2 and nLCcompd229) were uniquely regulated byUnknown 4. In addition, the levels of three other peaks (palmitic acid,nLCcmpd59, and nLCcmpd77) were also observed to change in only one othertreatment each (naptalam, paraquat and glyphosate, respectively). Fourpeaks (pLCcmpd71, pLCcmpd234, ornithine, and C18 fatty acids) weredetermined to be uniquely regulated by Unknown 1. The metabolitesregulated in the other three unknown compounds were shared among severalother treatments.

[0306] Numerous peaks were commonly regulated among a majority of thetreatments. For example, sitosterol, octadecadienoic acid, mevalonatelactone, pipecolic acid, ascorbic acid, indoleacetonitrile, andsuccinate were up-regulated in a variety of treatments. Data derivedfrom plants subjected to various stresses suggested that plants inducechanges in many of these metabolites as part of general stress response(unpublished). In addition to known metabolites, the regulation of anumber of unidentified peaks was also shared among many treatments.Based on the similarity of the responses to the known metabolites, it isexpected that the unidentified peaks may also be stress-relatedmetabolites.

[0307] Treatment of plants with several other herbicides resulted in theperturbation of only a few putative stress-related metabolites. Forexample, neither butylate nor chlorpropham treatment resulted in manychanges in these commonly regulated metabolites and neither showed astrong herbicidal phenotype. Only a few, if any, stress-relatedmetabolites were observed with glufosinate, imazapyr, and norflurazontreatments. These observations may be explained by the slow developmentof symptoms for imazapyr and norflurazon, suggesting that responses tothese herbicides may not be apparent in the first hour of post-spraying.Similarly, it has been reported that glufosinate is also slow acting andpoorly transported throughout the plant.

[0308] Data from the LC-MS and GC-MS platforms were combined for eachtime point and used for hierarchical cluster analysis. For eachtreatment, the response of each metabolite was converted to astandardized difference from control on a log scale. A subset ofmetabolites that showed differential expression (p<0.10) in at least onetreatment was extracted. The principle components of this subset werecalculated and used to cluster the biochemical profiling data.

[0309] Clustering of BCP data from both time points yielded differentresults. However, for both time points the bleaching herbicides,isoxaflutole and norflurazon, as well as glufosinate clustered closelytogether, while amitrole and glyphosate grouped together with bleachingherbicides, were found in other areas in the dendogram. In addition, thepositions of the ALS inhibitors and the photosynthesis inhibitors, whichclustered together phenotypically, did not group together at either timepoint. The relationships of the five unknown compounds to each other andto the commercial herbicides was different for each time point, althoughUnknown 1 and Unknown 4 remained in close proximity in both cases.

[0310] Due to the observation that the commercial herbicides with thesame or similar modes-of-action did not cluster well in theseexperiments, a clear relationship of unknown compounds to the commercialherbicides or to each other cannot be gleaned from the present analyses.Factors that may have contributed to the results include: a)kinetics-of-action unaccounted for in each herbicide; and b) the lownumber of regulated metabolites in the samples (Table 7). For example,compounds with the same or similar MOAs may have different efficienciesfor compound delivery to their target site. Additionally, the efficiencywith which the compound inhibits the target may also vary. Thus, it ispossible that some of the compounds may show more or less expression ofmetabolic changes depending on how rapidly they gain entry into theplant tissues and/or target organelles and how well they inhibit thetarget enzyme(s). Non-target effects within the plant cells may alsocontribute to variation seen between compounds with common MOAs. Theresults based purely on biochemical profiling data serve to illustratethe complexity involved when examining a biological system, and point toa need for an ability to collect and store large amounts of data whichcan be analyzed as one set. The methods of the present inventionintroduce a solution to the problem of storing and analyzing complex andcomprehensive data sets that can serve as models of biological systems.

[0311] Gene Expression Analysis

[0312] Gene expression analysis was performed on the five unknowncompounds and five commercial compounds at the one-hour time point. Twocommercial herbicides were selected based on their phenotypicsimilarities with unknown compounds (isoxaflutole is similar to Unknown1 and glufosinate is similar to Unknown 4), and three were identified asrepresentative of diverse MOA compounds.

[0313] All gene expression experiments were performed with arrayscontaining 22,000 Arabidopsis genes. Each treatment was compared to acontrol sample and each experiment was repeated with cyamin dye swappingto eliminate dye detection biases. The resulting data was analyzed usingRosetta RESOLVER software (Rosetta Inpharmatics, Inc., Kirkland, Wash.).The total number of genes in each treatment that were down-regulated andup-regulated are shown in Table 8. In addition, the regulated genes foreach treatment were compared to a list of “lethal” genes that havepreviously been identified (unpublished). A “lethal” gene is one withoutwhich a plant cannot survive, and so is a likely herbicide target.

[0314] The treatments resulting in the fewest gene expressionperturbations were the commercial compounds, asulam and naptalam. Allother treatments showed comparable levels of regulated genes exceptUnknown 4. Treatment with Unknown 4 resulted in nearly ten times as manyperturbed genes as compared to the other treatments indicating thatUnknown 4 acts very rapidly within plant tissues. TABLE 8 RegulatedGenes Following Herbicide Treatment 1 hr, p < 0.5 Compound Down UpUnknown 1 45 223 Unknown 2 99 221 Unknown 3 134 119 Unknown 4 1866 1462Unknown 5 144 192 Asulam 7 80 Chlorsulfuron 109 97 Glufosinate 54 296Isoxaflutole 370 143 Naptalam 47 50

[0315] The relationships among the treatments were examined usinghierarchical cluster analysis based on the principal components fromeach data set (FIG. 16). For cluster analysis, the expression of eachgene for each treatment was converted to a logarithmic scale andcalculated as a standardized difference from control. A subset of genesthat showed differential expression (p<0.01) in at least one treatmentwas extracted. The principle components of this subset of geneexpression data were calculated and used to cluster the gene expressiondata (FIG. 16).

[0316] The resulting dendrogram of gene expression data showscharacteristics of arbitrary clustering. Only isoxaflutole andchlorsulfuron grouped in an independent clade. The other compoundsshowed a stairstep pattern in the dendrogram indicating very littleoverlap between regulated gene sets. Unknown 4 is separated from theremaining compounds as expected based on the relatively large number ofregulated genes following this treatment.

[0317] Because the clustering results indicate arbitrary clustering, therelationship of the unknown compounds to the commercial herbicides or toeach other cannot be gleaned from these analyses. Although the majorityof the genome was surveyed in these experiments, and the numbers ofregulated genes in the treated samples is relatively high as compared tothe number of significantly regulated metabolites, the same caveatsrelating to sample production for the metabolite analysis apply to thisanalysis as well, again illustrating the need for a way to combine andanalyze all of the data available in one directly comparable data set.

[0318] Combined Data Cluster Analysis

[0319] In an attempt to identify relationships among the unknowncompounds and commercial herbicides, data from all three technologies(gene expression analysis, metabolite analysis, andmorphologic/phenotypic analysis) were used in combination forhierarchical cluster analysis. To give equal weighting to each data set,the principal components were used in the cluster analysis. Theprincipal components for the metabolite data and gene expression datawere derived as described above. The phenotypic data were coded asdeviations from control. That is, the control value of any phenotypicmeasurement was set to 0, and positive numbers indicate phenotypesgreater than control, while negative numbers indicate phenotypes lessthan control. The principle components of the phenotypic data werecalculated for each treatment class.

[0320] Data from the unknown compounds and the five commercialherbicides for which gene expression analysis, metabolite analysis, andmorphologic analysis data was available were used in this analysis. Theprinciple components of the data for these 10 treatments were combinedand a cluster analysis was performed on the combined dataset of 30principle components. The results are shown in FIG. 17.

[0321] The combined data cluster analysis produced more definitiveresults as compared to the gene expression data alone (i.e. not randomclusters). However, the data set does not include herbicides with thesame MOAs and therefore it is not possible to establish conclusiverelationships based on the dendrogram. The inclusion of data from theadditional commercial herbicides may help to clarify the relationshipsbetween the unknown compounds and the commercial compounds.

[0322] Fungal Nutritional Profiling Analysis

[0323] Minimally inhibitory concentrations were determined for eachunknown compound using a two-fold dilution series in minimal media. Inthe nutritional experiments, M. grisea was only sensitive to Unknown 1at the highest concentration tested. No other compounds inhibitedgrowth, however Unknown 5 was insoluble at the highest concentrationstested. Table 9 lists the concentrations used for nutritional profilinganalysis for each compound. Unknown 4 showed some growth inhibition at250 mg/ml. TABLE 9 Test Concentrations for Nutritional ProfilingCompound Inhibitory Sub-inhibitory I.D. Concentration concentrationUnknown 1 500 μg/ml 6.25 μg/ml Unknown 2 n/a 500 μg/ml Unknown 3 n/a 500μg/ml Unknown 4 n/a 250 μg/ml Unknown 5 n/a 30 μg/ml

[0324] Nutritional Profiling: Tier 1

[0325] Tier 1 includes minimal and supplemented media containing alltest nutrients. The experiments were performed using the concentrationsshown in Table 9. Unknown 1 was tested at both inhibitory andsub-inhibitory concentrations. The concentration of DMSO was normalizedfor all test compounds and the negative controls. Growth was monitoredover seven days. Each treatment was performed in duplicate.

[0326] As expected, growth of M. grisea in the presence of Unknown 1 wasinhibited in minimal media. No growth was observed in the supplementedmedia indicating that growth in the presence of Unknown 1 could not beremediated in the presence of any of the nutrients tested. Unknown 2,Unknown 3, and Unknown 5 showed no growth defect in either media,indicating that growth of M. grisea in the presence of these compoundswas unaffected by addition of these nutrients. Growth of M. grisea inthe presence of Unknown 4 was partially inhibited in minimal media andwas remediated by the addition of supplements, indicating that one ormore nutrients in the supplemented media abrogated the effect of Unknown4 on growth.

[0327] Unknown 1 was also tested at a sub-inhibitory concentration. Onlya slight inhibition of growth of M. grisea was observed in minimal andsupplemented media, again indicating that the mode-of-action of Unknown1 was unaffected by the addition of these nutrients.

[0328] Nutritional Profiling: Tier 2

[0329] The supplements tested in Tier 1 were subdivided into fourgroups, or sub-pools, consisting of amino acids, purines andpyrimidines, vitamins and cofactors subset 1, and vitamins and cofactorssubset 2. Growth of M. grisea with and without Unknown 4 in eachsub-pool, minimal and fully supplemented media was tested.

[0330] Growth of M. grisea in the presence of Unknown 4 was remediatedin fully supplemented media, the amino acid sub-pool, and significantlyremediated in the purine/pyrimidine sub-pool. Reduced growth wasobserved in the other media tested. Restoration of growth in both aminoacid and purine/pyrimidine pools indicates that Unknown 4 may act on acentral nutrient utilization pathway and not on a specific biosyntheticpathway. To examine this further, the amino acid sub-pool was furthersubdivided and tested.

[0331] Nutritional Profiling: Tier 3

[0332] The amino acid sub-pool from Tier 2 was subdivided into fivefurther sub-pools including aromatic, sulfur containing,aliphatic/aliphatic hydroxy, basic+asn/pro, and acidic+gln amino acids.Growth of M. grisea with and without Unknown 4 in each sub-pool andminimal media was tested.

[0333] Growth of M. grisea in the presence of Unknown 4 was remediatedin media containing aromatic amino acids, asp/glu/gln, and to a slightlylesser extent, basic+asn/pro amino acids. Growth on aliphatic/aliphatichydroxy and sulfur amino acids was similar or less than the levels ofgrowth in minimal media in these experiments.

[0334] Again, restoration of growth in multiple amino acid poolsindicates that Unknown 4 may act on a central nutrient utilizationpathway and not on a specific biosynthetic pathway. In addition, inprevious experiments, M. grisea was able to efficiently utilizearomatic, asp, glu, asn, pro, and basic amino acids as nitrogen sources.These results suggest that Unknown 4 may be negatively affectingnitrogen source utilization in M. grisea. A final tier of experimentswas performed to address a potential nitrogen source utilization defectin the presence of Unknown 4.

[0335] Nutritional Profiling: Tier 4

[0336] Nitrogen source assimilation has been studied in severalfilamentous fungi. Typically, nitrate is converted to nitrite by nitratereductase. Nitrite is converted to ammonia by nitrite reductase followedby assimilation into glutamine by glutamine synthetase. The amine groupcan then be used to generate glutamate from alpha-ketoglutarate. InAspergillus nidulans, the regulation of nitrogen utilization has beenstudied extensively. When the preferred nitrogen sources, ammonia orglutamine, are present, nitrogen metabolite repression inhibitsexpression of genes required for utilization of other nitrogen sourcessuch as nitrate, nitrite, and glutamate.

[0337] The effect of Unknown 4 on nitrogen source utilization was testedby providing various nitrogen sources. Growth of M. grisea with andwithout Unknown 4 in the presence of each of the nitrogen sources wastested.

[0338] Growth of M. grisea in the presence of Unknown 4 was recoveredwhen ammonium or glutamine was used as a nitrogen source. Reduced growthwas observed when nitrate or glutamate was used as a nitrogen source.Growth was inhibited completely in the presence of nitrite as the solenitrogen source.

[0339] The fungal nutritional profiling results from Unknown 4 werecompared to glyphosate at the same and higher concentrations (250 μg/mland 1 mg/ml, respectively). The growth results with glyphosate at 250μg/ml for Tiers 3 and 4 were nearly identical as compared to Unknown 4.The growth results with glyphosate at 1.0 mg/ml were consistent with theMOA of glyphosate, a block in aromatic amino acid biosynthesis. Growthinhibition by glyphosate at this concentration was remediated byinclusion of aromatic amino acids to the media. Based on these results,it was determined that the MOA of Unknown 4 was aromatic amino acidbiosynthesis.

[0340] Validation Data for Isoxaflutole

[0341] The site-of-action of isoxaflutole is 4-hydroxyphenylpyruvatedioxygenase (HPPD, E.C. 1.13.11.27), which converts4-hydroxyphenylpyruvate to homogentisate. Homogentisate is a precursorto α-tocopherols and plastoquinones. It is believed that carotenoidbiosynthesis is indirectly inhibited by depletion of plastoquinones, acofactor of phytoene desaturase, resulting in the bleaching phenotypeobserved with isoxaflutole. Tyrosine is an upstream precursor tohomogentisate biosynthesis and, in some organisms including humans,phenylalanine can be converted to tyrosine via phenylalanine hydrolase.

[0342] Examination of the metabolite data for isoxaflutole revealed thatboth tyrosine and phenylalanine were up-regulated relative to thecontrol. Homogentisate was undetectable in all samples including thecontrols. Alpha-tocopherol was detected, but the levels were notsignificantly changed relative to the control at the early time points.The identification of increases in tyrosine and phenylalanine in theisoxaflutole data support the use of metabolite data for analysis ofherbicide site- or pathway-of-action. However, alterations in theexpression of genes involved in the homogentisate biosynthetic pathwaywere not observed in these experiments. It is possible that the specificeffects of isoxaflutole on this pathway do not perturb gene expressionof this pathway specifically or at this early time point. Furtheranalysis of gene expression at later time points is required.

[0343] Summary of the Analysis of Unknown 1

[0344] Phenotypic data from plants following Unknown 1 treatmentsuggests that the observed mode-of-action is similar to carotenoidbiosynthesis inhibitors. Cluster analysis using the correspondingmetabolite or gene expression data did not group this compound with theother bleaching herbicides (amitrole, isoxaflutole, and norflurazon),although the latter two clustered relatively close based on metabolitedata at both the 20 minute and 1 hour time points. The fatty acidprofile of Arabidopsis treated with Unknown 1 was altered. An increasein saturated and mono-unsaturated C18 fatty acids (Table 7) andlinolenic acid was observed. An increase in linolenic acid was observedin several other treatments and may be related to a general stressresponse that results in the production of jasmonic acid. However, theincrease in C18 fatty acids is unique to Unknown 1 and treatment ofplants with any C18 fatty acid has been shown to induce cell death.

[0345] In the fungal nutritional profiling platform, Unknown 1 was ableto completely inhibit growth of M. grisea in minimal and supplementedmedia. In addition, no growth defect was observed in minimal,supplemented, or minimal plus tyrosine as sole nitrogen source at asub-inhibitory concentration. When treated with isoxaflutole atconcentrations insufficient to inhibit growth, M. grisea growth wasinhibited in minimal plus tyrosine media, while growth in minimal mediawas unaffected. Since the results differ from those obtained in Unknown1, the target of isoxaflutole (HPPD) is not likely the same as thetarget of Unknown 1.

[0346] Summary of the Analysis of Unknown 4

[0347] The fungal nutritional profiling results obtained from Unknown 4at the partially inhibitory concentration (250 μg/ml) were nearlyidentical to the growth characteristics of glyphosate at the sameconcentration (partially inhibitory) in the various media tested inTiers 3 and 4. Based on these results, it was determined that themode-of-action of Unknown 4 was aromatic amino acid biosynthesis.However, the results for both Unknown 4 and glyphosate suggest that theyeffect nitrogen utilization. Both inhibited growth of M. grisea whennitrate, nitrite, or glutamate was provided as the sole nitrogen source.Little growth defect was observed when ammonium or glutamine wasprovided. The results differ from glufosinate, whose site-of-action isglutamine synthetase. In the presence of glufosinate, M. grisea is onlyable to utilize glutamine and glutamate as nitrogen sources.

[0348] Thus, it is hypothesized that Unknown 4 may also effect nitrogenutilization and/or metabolism in Arabidopsis. In plants, nitrogenregulation is very complex and is closely associated with carbonutilization. However, studies of nitrate addition to N-starvedArabidopsis plants have identified several nitrate-regulated genes. IfUnknown 4 inhibits nitrogen utilization, addition of this compound toArabidopsis may have the opposite effect on these genes. Table 10 listsa subset of these genes and their relative expression levels followingtreatment with Unknown 4. TABLE 10 Expression of Nitrate Regulated GenesFollowing Unknown 4 Treatment Gene Unknown 4 Nitrate RegulationPhosphate transporter ↓ ↑ Transaldolase ↓ ↑ Transketolase ↓ ↑ MalateDehydrogenase ↓ ↑ MYB transcription factor ↓ ↑ Nitrate transporter ↓ ↑Glutamine synthetase(2) ↓ ↑ Glutamate synthetase ↓ ↑ MADs Box(2) ↑ ↓

[0349] Both fungal growth and gene expression data support thehypothesis that Unknown 4 alters nitrogen source metabolism in bothorganisms. Fungal growth data helped guide the analysis of both geneexpression and metabolite data, although no specific conclusions weremade from the metabolite data at the recorded time points. Since plantstreated with Unknown 4 exhibited large perturbation in the total numberof genes, an internal database was searched to identify whether any ofthe genes altered by Unknown 4 treatment were found to be essential forplant growth and development. The internal database search has revealedthat a total of 86 genes that were altered by Unknown 4 treatment werefound to be essential for plant growth and development (Table 11). Asubset of these 86 genes includes five genes believed to participate innitrogen metabolism, which further lends credibility to the conclusionsderived from the fungal nutritional profiling platform discussed herein.TABLE 11 Genes Altered by Various herbicides and Identified as EssentialGenes Compound No. of Lethal Genes Altered Unknown 1 8 Unknown 2 7Unknown 3 6 Unknown 4 86 Unknown 5 12 Asulam 5 Chlorsulfuron 8Glufosinate 10 Isoxaflutole 15 Naptalam 3

[0350] Phenotypic, biochemical, and gene expression data were gatheredto determine the effects of five unknown herbicide candidates and up to18 commercial herbicides in Arabidopsis after brief treatments with ahigh dose of each compound. Fungal nutritional profiling was employed asa surrogate biological system to examine the effects of nutrientutilization in M. grisea in the presence of each compound.

[0351] From the data collected, an example was obtained in whichmetabolites upstream from the site-of-action were accumulating after 1hour (isoxaflutole). It was also shown that by using results from fungalnutritional profiling, a hypothetical mode-of-action of Unknown 4 in M.grisea was posited and supported by gene expression data fromArabidopsis.

[0352] As described in Specific Example 2, site-of-action experimentaldata were collected from samples taken at relatively late time points ascompared to the presently described study. Sample collections werecalibrated to each herbicide based on 10%, 30%, and 70% of the timerequired for full symptom development. For example, the 10% and 70%sampling points for the fast acting herbicide, paraquat, were 5 and 48hours, respectively, while the analogous time points for the slow actingherbicide, chlorsulfuron, were 24 and 168 hours, respectively. Althoughclustering of the herbicides based on gene expression and metabolitedata was more accurate using these time points, the identification ofsite- or pathway-of-action was not achieved.

[0353] In the experiments performed in the present study, the timepoints for sampling were fixed at 20 minutes and 1 hour followingtreatment, without accounting for the kinetics of action. Geneexpression and metabolite data from these early time points did notcluster as expected based on known modes of action (MOAs). It is wellknown that the time to response varies due to many factors, such asuptake and transport, even for compounds that target the same site.Thus, at fixed time points as used in the present study, the genes andmetabolites specifically perturbed by each compound or MOA class may notbe fully expressed or expressed to the same levels.

[0354] In spite of the fact that the experiments presented herein maynot be the most ideal sampling time points, informative data wereobtained. Metabolites upstream of the SOA of isoxaflutole (tyrosine andphenylalanine), began to accumulate relative to the control after onehour. In addition, the metabolites downstream from the SOA of glyphosate(tyrosine), decreased relative to the control after one hour. A group ofstress related metabolites were observed to increase after one hour in12 of 23 herbicides tested including Unknown 1, Unknown 2, and Unknown4, suggesting that the kinetics of action of these herbicides wererapid. Three unknown metabolites (pLCcpnd9, 78, and 310) were alsoobserved to increase in eight treatments after 20 minutes and thus, theymay represent early stress markers.

[0355] Based on the results as described herein, it is hypothesized thatexperiments performed with intermediate time points which are calibratedto each herbicide may help more accurately identify the point at whichclustering begins to occur (i.e. later or equal to the time points usedin the present study, but earlier than the time points used in theprevious study). With the addition of initial clustering data, the datasets may be enriched for specific metabolites and gene expressionresponses that can be used to identify the site- or pathway-of-action.This can be tested using commercial herbicides with known MOAs.

[0356] The following is an example of an approach to optimizing andimplementing an experimental design to increase the value of thedescribed MOA analysis platform.

[0357] Define the kinetics-of-action. Several herbicides had very littleeffect on metabolite regulation in either time point tested (Table 7).This suggests that the herbicide may not have reached its target withinthe timeframe of sampling. Cell leakage assays could be used to identifythe point at which herbicidal action results in cell damage prior to theproduction of a visible phenotype. The onset of the visible phenotypecan also be used as a landmark. Sampling times could be chosen tobracket these time points.

[0358] Add additional time points. Increasing the number of time pointsfor each herbicide and bracketing relative to a kinetics-of-action wouldallow for trend analysis over time thereby enhancing the ability tointerpret metabolite and gene expression data. Additional time pointswill not require much more of each test compound with the presenttreatment procedure. At a rate equivalent to 1.0 kg/ha, only 0.85 mg ofherbicide was required per time point. Thus, 10 mg of a test compoundcan provide several more time points than was generated for this study.

[0359] Collect data for herbicides with known modes/sites-of-action.Data from herbicides with known modes/sites-of-action will help validatethe experimental design, enhance comparative approaches for analysis ofnew herbicides, and assist in the identification of herbicidescandidates with novel modes-of-action. In addition, these commercialherbicides can be used to determine the most appropriate sampling pointsfor various site-of-action classes. Proper clustering of commercialherbicides with known sites-of-action will validate particular samplingregimes.

[0360] Reanalyze metabolite data as new standards are run for peakidentification. An ongoing standards program for identifying metabolitesseen in biochemical profiling data could result in previously unrevealedand/or unidentified metabolites. Resolution within and between pathwayswill be enhanced as new metabolites are accurately identified.Advantageously, data already generated can be reanalyzed as new peaksare identified, thereby eliminating the need to repeat experiments.

[0361] Perform gene expression analysis on the same samples generatedfor metabolite analysis. Biological samples or total RNA can bedelivered for gene expression analysis. Gene expression analysis is acomplement to metabolite analysis by providing a link between metabolitechanges and gene expression changes. Previous reports have demonstratedthat greater degrees of clarity can be achieved using multiple datastreams for cluster analysis. With a proper sampling regime, geneexpression analysis should also provide valuable data for identifyingperturbed genes/pathways. Combined with the metabolite data, a higherresolution picture can emerge.

[0362] Continue using fungal nutritional profiling. Based on theanalysis of commercial herbicides, a positive result can identify thetarget pathway and may even identify the site-of-action in some cases.Additionally, the compound requirements are very small. Only 1.0 mg ofherbicide was required for the extended fungal nutritional profilingexperiments described for Unknown 4.

[0363] The above-described specific example illustrates the value ofcombining different types of data to obtain a more completerepresentation of a biological system. In this specific example, thecombination of gene expression data, metabolite data, and phenotypicdata allowed experimental conclusions to be drawn from coherent datathat was otherwise not likely have been drawn from a collective reviewof gene expression data, metabolite data, and phenotypic data analyzedseparately. Adding a fourth data source, that is nutritional profiling,only serves to increase the information available for drawingbiologically relevant conclusions, the results of which were used toguide the analysis of the gene expression and metabolite data.Additionally, populating the experimental data sets with data from“known” samples to use as controls gives valuable guidance when lookingat the large, combined, complex data sets.

[0364] The methods of the present invention provide ways to achievecreation of coherent data sets from data such as that set forth in theabove specific example. A coherent data set is not necessarily a closedsystem, and can accommodate the addition of new data as it becomesavailable. The above-described optimization process is an example of howthe specific example could be modified to strengthen its value as amodel for herbicide site- or pathway-of-action studies. The SOAR(Specific Example 2) and MOA1 (Specific Example 3) studies outlinedherein create the foundation for a comprehensive herbicide site-, mode-,and pathway-of-action coherent data set.

[0365] The results of the foregoing study, MOA1, show that it ispossible to accurately predict the MOA of herbicides using a combinationof technologies when the MOA is represented in an existing database. Thestrategy set forth herein, of standardizing and combining disparate datainto coherent data sets for the analysis of biological samples, willincrease the predictive power of the analysis. The strategy isapplicable to any experimental system and any data or technology,including alternatives not explored herein, such as protein expressionand activity profiling.

SPECIFIC EXAMPLE 4 Preparation of Cell Culture Samples for Analysis

[0366] Cell culture samples were either freeze-dried or fresh-frozen at−80° C. Cell culture samples were prepared for gene expression and LC-MSanalysis as described in the above examples for plant samples. For GC-MSanalysis, the lyophilized sample material was extracted and derivatizedin 96-well plates. The procedure yielded trimethylsilyl (TMS)derivatives for a variety of compounds including organic acids, fattyacids, amino acids, sugars, alcohols, and sterols. The basicderivatization procedure involved a two-step derivatization using MSTFA(methyl trimethylsilyl trifluoroacetamide) in acetonitrile, acidifiedwith trifluoroacetic acid, followed by derivatization with a stronglybasic silylating agent such as TMSDMA (trimethylsilyldimethylamine).

SPECIFIC EXAMPLE 5 Yeast Azole Drug Experiment

[0367] Ergosterol is an essential component of fungal plasma membranes.It affects membrane permeability and the activities of membrane-boundenzymes. This sterol is a major component of secretory vesicles and hasan important role in mitochondrial respiration and oxidativephosphorylation. G. Daum et al., 14 YEAST 1471-1510 (1998). It can thusbe expected that changes in ergosterol levels and sterol structureinfluence the activities of several metabolic pathways. Enzymes in theergosterol biosynthetic pathway are the targets of a number ofanti-fungal agents. Over the past 40 years, amphoteracin B synthesizedby Streptomyces nodosus has been the mainstay of antifungal therapy forsevere systemic mycotic infections. F. C. Odds, Antifungal Therapy, inPRINCIPLES AND PRACTICE OF CLINICAL MYCOLOGY 35-48 (C. C. Kibbler et al.eds., 1996); H. J. Vanden Bossche et al., Discovery, Chemistry, Mode ofAction, and Selectivity of Itraconazole, in CUTANEOUS ANTIFUNGAL AGENTS263-283 (J. W. Rippon & R. A. Fromtling eds., 1993).

[0368] Amphoteracin B is capable of binding irreversibly to ergosterolin the fungal cytoplasmic membrane, thus increasing membranepermeability with ultimate fungal cell death. Despite its provenefficacy, use of the conventional formulation of amphoteracin B(amphoteracin B deoxycholate) is limited by potentially severe adversereactions, especially nephrotoxicity and infusion-related events. Overthe past 20 years, azoles, primarily ketoconazole and fluconazole thatare less toxic alternatives to amphoteracin B, have become attractive.The anti-fungal activities of azole derivatives arise from a complexmultimechanistic process initiated by the inhibition of two cytochromesP450 involved in the biosynthesis of ergosterol, namely, the P450 thatcatalyzes the 14-demethylation of lanosterol or eburicol (encoded byerg11), and 22-desaturase (encoded by erg5). D. C. Lamb et al., 43ANTIMICROB. AGENTS CHEMOTHER. 1725-1728 (1999).

[0369] However, there are problems with current azoles, namely, theirrelatively poor efficacy against invasive mold infections and concernabout emerging clinical and microbiologic resistance to azoles. Due tothe increasing prevalence of disseminated fungal infections associatedwith the acquired immune deficiency syndrome (AIDS) epidemic, increasedutilization of organ transplantation and immunosuppression, and theincreased number of invasive fungal nosocomial infections, antifungalagents are more widely used than ever before. Consequently, there is aneed for alternative drugs that are both efficacious and well tolerated.Posaconazole is a triazole that is structurally related to Itraconazole.It is currently in Phase III trials by Schering-Plough Corporation.Compared to two early azole drugs, posaconazole is a significantly morepotent inhibitor of sterol C14 demethylation, particularly inCryptococcus neoformans and Aspergillus spp. K. L. Oakley et al., 41ANTIMICROB. AGENTS CHEMOTHER. 1124-1126 (1997).

[0370] The rapid development of genomics in the past several yearsprovided unique access to genes and regulatory elements of individualgenes at the genome level. Successful application of the genomictechniques, such as DNA microarrays for exploring transcriptionalprofiles and genome differences for a variety of microorganisms, hasgreatly facilitated an understanding of mode of action of variousanti-fungal drugs. M. D. De Backer, et al., 45 ANTIMICROB. AGENTSCHEMOTHER. 1660-1670 (2001); M. H. Jia et al., 3 PHYSIOL. GENOMICS.83-92 (2000). However, microarrays might not provide direct informationabout how the mRNA change is coupled to the change in biologicalfunctions, because the rate of enzymatic reactions is a function ofsubstrates and products (metabolomes). O. Fiehn, 48 PLANT MOL. BIOL.155-171 (2002); B. H. Ter Kuile & H. V. Westerhoff, 500 FEBS LETT.169-171 (2001).

[0371] Moreover, for most organisms, there is no direct relationshipbetween metabolites and genes in the way that there is for mRNA andproteins. For example, S. cerevisiae has fewer than 600low-molecular-weight metabolite intermediates and has approximately 6200protein-encoding genes. Metabolomics, as a method to define the smallmolecule diversity in cell and to display the differences of smallmolecule abundance, exhibits many advantages in terms of metabolicanalyses. As functional entities within cells, metabolite concentrationlevels are varied as a consequence of genetic and/or physiologicalchanges. Profiling of up to 68 primary metabolites has been successfullydemonstrated to be useful for clinical research by differentiallycomparing healthy human tissues with diseased ones. J. M. Halket et al.,13 RAPID COMMUN. MASS SPECTROM. 279-284 (1999). A similar approach hasbeen taken in plant research, wherein mass spectrometry has been appliedto profile a limited number of primary metabolites. M. A. Adams et al.,266 ANAL. BIOCHEM. 77-84 (1999).

[0372] Metabolomics study is an important part of an integrativeapproach for accessing cellular metabolism and understanding mode ofaction of drugs. In the present specific example, the methods of theinvention are applied to an integrated genomic and metabolomic approachto reveal the mode of action of antifungal drugs. Using S. cerevisiae asa model system, the global metabolic consequences caused by thetreatment of four antifungal drugs (amphoteracin B, ketoconazole,fluconazole, and posaconazole) were examined at both the transcriptome(RNA) and metabolome (small molecule) levels. The integrative analysespresented a global view of the metabolic changes associated with eachdrug treatment, thus allowing for a better interpretation of the mode ofaction of antifungal drugs.

[0373] Materials and Methods

[0374] Strains and Media

[0375]Saccharomyces cerevisiae wild type strain BY4743 was purchasedfrom American Type Culture Center (ATCC, Manassas, Va.). The yeaststrain was grown in YPD or SD media. H. Ito et al., 153 J. BACTERIOL.163-168 (1983). The cultures started from fresh single colonies weregrown in 1.0 ml YPD overnight at 30° C. (The OD₆₀₀ values of overnightcultures are normally around 2.0 to 3.0 after 16 hours of growth). TheOD₆₀₀ was adjusted to 1.0 with YPD media, then 2.0 ml of each wasinoculated into three 250 ml flasks, each containing 50 ml of SD media.When the OD₆₀₀ reached 2.0, an amount equivalent to 2X MIC (minimalinhibitory concentration) of each of the four tested antifungal drugswas dissolved into 0.5 ml dimethyl sulphoxide (DMSO) and added into theculture. The cells were kept growing for another two hours, thencollected by centrifugation at 4000 rpm for 5 minutes at 4° C. Pelletswere washed once with ice-cold water, then were lyophilized overnight at4° C.

[0376] Determination of MIC

[0377] Antifungal drugs amphoteracin B, ketoconazole, and fluconazolewere purchased from Sigma (Sigma Chemical Co., St. Louis, Mo.), andposaconazole was a gift from Duke Medical Center (Duke Univ. MedicalCenter, Durham, N.C.). Minimal inhibitory concentration was determinedusing 96-well plates. 100 μl of the overnight culture was added to freshYPD media in a new sterile tube. The new tube was returned to the 37° C.shaker and incubated for 4 hours. The cells were spun down in themicrocentrifuge and washed twice with sterile dH₂O. The cells werediluted into YPD media and loaded into 96-well plates. The testedantifungal drug was dissolved into DMSO and added into plates at thefinal DMSO concentration of 1.0%.

[0378] RNA Extraction and Microarray Preparation

[0379] Approx. 18±1 mg of lyophilized yeast cells in a 1.5 mlmicrocentrifuge tube were rehydrated in 75 μL RNA LATER (Ambion, Inc.,Austin, Tex.) and incubated for 30 minutes. 875 μl TRIZOL Reagent(GibcoBRL, Rockville, Md.) were added to each tube. The tubes werevortexed for 15 seconds and allowed to rest for 45 seconds, repeated,and continued for a total of 5 minutes. 240 μl 100% Chloroform(RNAase-free) was added to each tube. Tubes were vortexed for 30seconds, then incubated for 10 minutes at room temperature (RT). Thetubes were then spun at 14,000 rpm in a refrigerated eppendorfcentrifuge at 4° C. for 5 minutes. 570 μl of the aqueous phase wasremoved and placed in a new, RNAase-free 2.0 ml tube. 430 μlnuclease-free water (Ambion, Inc., Austin, Tex.), and 1.0 ml 100%isopropanol, were added to each tube and mixed thoroughly by inversion.Tubes were incubated for 10 minutes at RT. Samples were centrifuged for20 minutes as before. Pellets were washed with 400 μl 70% ethanol andcentrifuged for 10 minutes as before. The pellet was then dissolved in100 μl nuclease-free water. RNA quality was determined using theBioanalyzer 2100 and the RNA 6000 assay (Agilent Technologies, PaloAlto, Calif.) according to manufacturer's instructions. RNAconcentrations were determined spectrophotometrically by measuring theabsorption at 260 nm in an Ultrospec 2000 (Pharmacia Biotech,Piscataway, N.J.). Microarrays containing approximately 6200 S.cerevisiae genes, essentially covering the entire genome, were generatedby Agilent Technologies using oligonucleotides 60 bases in lengthsynthesized in situ by an ink-jet printing method (Agilent Technologies,Palo Alto, Calif.).

[0380] Microarray Hybridizations

[0381] RNA samples were labeled with either Cy3 or Cy5 using Agilent'sFluorescent Linear Amplification Kit according to the manufacturer'sinstructions (Agilent Technologies, Palo Alto, Calif.). Labeled cRNAswere evaluated using the RNA 6000 assay on the Agilent Bioanalyzer 2100.Labeled cRNA concentrations were determined spectrophotometrically bymeasuring the absorption at 260 nm in an Ultrospec 2000 (PharmaciaBiotech, Piscataway, N.J.). Probe solutions containing 125 ng of labeledcRNA for each mutant and its paired control were prepared usingAgilent's in situ Hybridization Reagent Kit (Agilent Technologies, PaloAlto, Calif.). Each pair of samples to be hybridized were independentlylabeled and hybridized utilizing a fluor reversal for a total of twohybridizations per sample pair. The microarrays were scannedsimultaneously in the Cy3 and Cy5 channels with Agilent's 48-slide, DualLaser DNA Microarray Scanner (Agilent Technologies, Palo Alto, Calif.)at 10 μm resolution using default settings.

[0382] Microarray Data Processing and Analyses

[0383] Image Analysis Software (Version A.4.0.45, Agilent Technologies,Palo Alto, Calif.) was used for image analysis. Each feature wasdetermined from an array's associated pattern file and a detectionalgorithm. Intensity values for each feature were determined aftersubtracting background derived from an average of negative controlfeatures. Features with unusual pixel intensity statistics (e.g., highnon-uniformity, saturation in either channel, and the like) wereexcluded from downstream analyses. Data was loaded into the RosettaRESOLVER database (Rosetta Inpharmatics Inc., Kirkland, Wash.) forstorage and analysis. Data was evaluated after combining results fromfluor reversal replicate hybridizations. The annotation of yeast ORFswas obtained from Proteome BIOKNOWLEDGE Library (Incyte Genomics, PaloAlto, Calif.).

[0384] GC-MS Derivatization and Analyses

[0385] Approximately 10 mg of dried ground cells were extracted insolvent, converted to trimethylsilyl derivatives in-situ, and analyzedby gas chromatography with time of flight mass spectrometry (GC/TOF-MS)as desribed previously. Separations were conducted using a 50%phenyl-50% methyl stationary phase, helium carrier gas, and a programmedoven temperature that ramped from a starting temperature of 50° C. to afinal temperature of over 300° C. Compounds detected by GC-MS with anelectron impact (EI) ion source were cataloged based on Kovats retentionindices and mass-to-charge ratio (m/z) of the ions characteristic ofeach peak. Commercially available reference compounds were obtained fromSigma-Aldrich (Sigma Chemical Co., St. Louis, Mo.) or VWR (VWRScientific Products, Baltimore, Md.). Table 12 provides a list ofdetected compounds.

[0386] LC-MS Procedures

[0387] Approximately 10 mg of dried ground cells were extracted in 0.5ml 10% aqueous methanol containing labeled internal standards. Tissuewas disrupted by a 30 second pulse of high-level sonic energy(lithotripsy), at a maximum temperature of 30° C. The extract wascentrifuged at 4000 rpm for 2 minutes. The supernatant, diluted with anequal volumn of 50% aqueous acetonitrile (V/V) was chromatographed onC18 HPLC in an acetonitrile/water gradient containing 5 mM ammoniumacetate. Samples were passed through a splitter and the split flow wasinfused to the trubo-ionspray ionization sources of two Mariner LC TOFmass spectrometers (PerSeptive Biosystems Inc., Framingham, Mass.). Ionsources were optimized to generate and monitor positive (pLC) andnegative (nLC) ions, respectively. The Total Ion Chromatogram (TIC) wasanalyzed for compounds with masses ranging from 80 to 900 Da. Individualion traces were used for both calibration and quantification. Relativeamounts of compounds were determined using intensity and peak areas ofindividual ion traces. Isotopically labeled internal standards were usedfor peak area ratios, response factor determination, and normalizationof data throughout the experiment. Table 12 provides a list of detectedcompounds. TABLE 12 Detected Metabolites Treatment Compound Platformp-Value Fold Change Amphoteracin B 2-ketobutyric acid nLC 0.225631474−0.999150382 Amphoteracin B 2-ketoglutaric nLC 0.622408732 8.790891018Amphoteracin B 3-indolylacetonitri nLC 0.197015297 −0.999782451Amphoteracin B 4ambutyrate/dimglyc pLC 0.920009792 0.01278731Amphoteracin B 4-aminobenzoic acid pLC 1 0 Amphoteracin B4aminobutyrate/dimg nLC 0.820379809 −0.198261949 Amphoteracin B4-methylcatechol nLC 1 0 Amphoteracin B 4-methylcatechol PLC 1 0Amphoteracin B 5hydroxyLtryptophan nLC 1 0 Amphoteracin B5hydroxyLtryptophan pLC 1 0 Amphoteracin B 6benzylaminopurine PLC 1 0Amphoteracin B 6-benzylaminopurine nLC 1 0 Amphoteracin B abscisic acidnLC 1 0 Amphoteracin B abscisic acid LC 1 0 Amphoteracin B aconitic acidnLC 0.891314692 0.542608239 Amphoteracin B adenine nLC 1 0 AmphoteracinB adenine pLC 0.892253251 0.115293431 Amphoteracin B adenosine nLC 1 0Amphoteracin B adenosine LC 1 0 Amphoteracin B alanine GC 0.054639399−0.777259086 Amphoteracin B alanine nLC 0.62524207 0.159889332Amphoteracin B alanine/sarcosine pLC 0.540255177 0.260223791Amphoteracin B allantoic acid nLC 0.777964345 −0.145621023 AmphoteracinB allantoic acid pLC 1 0 Amphoteracin B allantoin nLC 0.1491693523.969743665 Amphoteracin B anthranilic acid nLC 1 0 Amphoteracin Banthranilic acid pLC 1 0 Amphoteracin B arginine nLC 0.315413423−0.48852387 Amphoteracin B arginine pLC 0.522893347 0.466194768Amphoteracin B argininosuccinate nLC 1 0 Amphoteracin Bargininosuccinate LC 1 0 Amphoteracin B asparagine GC 6.41E-06−0.999990003 Amphoteracin B asparagine nLC 0.758489047 0.151122053Amphoteracin B asparagine pLC 0.526485859 0.634356489 Amphoteracin Baspartic nLC 0.703732114 0.240696517 Amphoteracin B aspartic acid GC0.024172801 −0.974333333 Amphoteracin B aspartic acid pLC 0.6260012570.359387817 Amphoteracin B benzoic acid nLC 1 0 Amphoteracin B biotinnLC 0.363209057 1.077063265 Amphoteracin B biotin pLC 1 0 Amphoteracin Bcaffeic acid nLC 0.427037943 −0.58815132 Amphoteracin B caffeine LC 1 0Amphoteracin B campesterol GC 1 0 Amphoteracin B catechol nLC 1 0Amphoteracin B cinnamic acid nLC 1 0 Amphoteracin B citric acid TME pLC1 0 Amphoteracin B citricanoic/itaconi nLC 1 0 Amphoteracin B citrullinenLC 0.217809679 1659.333333 Amphoteracin B citrulline pLC 0.4333662831.798113764 Amphoteracin B coumaric acid nLC 1 0 Amphoteracin B cytidinenLC 0.000680198 −0.998602701 Amphoteracin B cytidine pLC 1 0Amphoteracin B cytosine nLC 1 0 Amphoteracin B cytosine pLC 1 0Amphoteracin B decanoic acid nLC 0.824169685 −0.087161599 Amphoteracin Bdesmosterol GC 1 0 Amphoteracin B diaminopimelic acid nLC 1 0Amphoteracin B diaminopimelic acid pLC 1 0 Amphoteracin B dihydrofolicacid nLC 1 0 Amphoteracin B dihydrofolic acid pLC 1 0 Amphoteracin Bdipicolinic acid pLC 1 0 Amphoteracin B disaccaride1 GC 7.54E-06−0.99999 Amphoteracin B disaccaride2 GC 0.000388379 −0.997666667Amphoteracin B disaccaride3 GC 0.000700744 −0.997666667 Amphoteracin BDLaminoadipic acid nLC 0.40985594 1.410752688 Amphoteracin BDL-aminoadipic acid pLC 0.229215472 0.470273881 Amphoteracin Bergosterol GC 0.114118055 2.303333333 Amphoteracin B estrone nLC 1 0Amphoteracin B farnesol nLC 1 0 Amphoteracin B folic acid nLC 1 0Amphoteracin B folic acid pLC 1 0 Amphoteracin B fucosterol GC0.186711806 −0.655333333 Amphoteracin B fumaric/3m2oxobutan nLC0.238894937 0.442276246 Amphoteracin B gallic acid nLC 0.299606440.162157188 Amphoteracin B gibberellic nLC 1 0 Amphoteracin Bglucosamine pLC 1 0 Amphoteracin B glucosamine6PO4 nLC 0.273438701−0.995114007 Amphoteracin B glucosamine6PO4 pLC 1 0 Amphoteracin Bglutamate pLC 0.563874982 0.43414851 Amphoteracin B glutamic/acetylserinLC 0.733962176 0.141411563 Amphoteracin B glutamine GC 0.019324613−0.911637212 Amphoteracin B giutamine/lysine nLC 0.835677767 0.079191524Amphoteracin B glutamine/lysine LC 0.618892728 0.398094054 AmphoteracinB glutathione pLC 0.951676383 −0.033484535 Amphoteracin B glycanopyroseGC 0.041202857 −0.957996667 Amphoteracin B glycerol GC 0.089234962−0.815 Amphoteracin B glycine GC 0.431923912 0.880666667 Amphoteracin Bguanine nLC 1 0 Amphoteracin B guanosine nLC 0.425021511 1.131147541Amphoteracin B guanosine LC 0.886514477 0.759776536 Amphoteracin Bhexadecanoic acid GC 0.921125845 −0.242666667 Amphoteracin B histidinenLC 1 0 Amphoteracin B histidine pLC 1 0 Amphoteracin Bhomogentisic/uric nLC 1 0 Amphoteracin B hydrocortisone nLC 1 0Amphoteracin B hydrocortisone pLC 1 0 Amphoteracin B hypoxanthine nLC0.372039959 0.165495208 Amphoteracin B hypoxanthine pLC 0.740826320.205678879 Amphoteracin B indole3pyruvic acid nLC 1 0 Amphoteracin Binostol/glucos/sorb nLC 0.41837757 −0.590534418 Amphoteracin B isocitric acid GC 0.233348939 −0.618333333 Amphoteracin Bisocitric/citric/qu nLC 0.027544382 2.549468869 Amphoteracin Bisoleucine GC 0.021030517 −0.953333333 Amphoteracin B itaconic aciddimes LC 1 0 Amphoteracin B jasmonic acid nLC 1 0 Amphoteracin B kinetinnLC 1 0 Amphoteracin B kinetin pLC 1 0 Amphoteracin B lactic acid nLC0.077524891 −0.025833603 Amphoteracin B lanosterol GC 7.71E-06 −0.99999Amphoteracin B lauric acid nLC 0.972245476 0.122549629 Amphoteracin Bleucine GC 0.015876175 −0.944333333 Amphoteracin B leucine/isoleucine/nLC 0.763305915 0.131916357 Amphoteracin B leucine/isoleucine/ pLC0.723852356 0.274641204 Amphoteracin B luteolin nLC 1 0 Amphoteracin Bluteolin pLC 1 0 Amphoteracin B lysine GC 0.488896519 −0.392535821Amphoteracin B malic acid GC 0.015444108 −0.963005665 Amphoteracin Bmalic acid nLC 0.497517178 0.621171595 Amphoteracin B malonic acid nLC 10 Amphoteracin B mannitol pLC 0.575742486 0.45428497 Amphoteracin Bmenthol* nLC 0.852876357 −0.07013498 Amphoteracin B methionine nLC 1 0Amphoteracin B methionine LC 0.367502423 0.329889113 Amphoteracin Bmevalonic acid GC 0.690626296 −0.127624125 lactone Amphoteracin Bmevalonic lactone pLC 0.251617022 −0.460562414 Amphoteracin BNacetylDglucosamine nLC 1 0 Amphoteracin B NacetylDglucosamine pLC 1 0Amphoteracin B NacetylLglutamate nLC 0.840704909 −0.107788162Amphoteracin B NacetylLglutamate LC 1 0 Amphoteracin B NacetylLornithinenLC 1 0 Amphoteracin B NacetylLornithine pLC 0.392871315 1.318875781Amphoteracin B niacinamide LC 1 0 Amphoteracin B nicotinic acid nLC0.972130606 0.31 3077939 Amphoteracin B nicotinic acid pLC 7.53474E-05−0.99893617 Amphoteracin B nopaline nLC 0.369522244 0.334229391Amphoteracin B nopaline pLC 1 0 Amphoteracin B octadecanoic acid GC0.660192025 0.21 Amphoteracin B oleic acid GC 0.325422554 −0.459333333Amphoteracin B oleic acid nLC 0.880270386 0.688969565 Amphoteracin Bornithine nLC 0.473753211 2.534415913 Amphoteracin B ornithine pLC0.48461244 0.504866344 Amphoteracin B ornithine2 GC 1.48992E-05 −0.99999Amphoteracin B ornithine3 GC 0.011300115 −0.985326667 Amphoteracin Borotic acid nLC 0.186179266 8380 Amphoteracin B palmiteliadic acid GC0.503020409 0.515 Amphoteracin B palmitic acid nLC 0.902806120.397948025 Amphoteracin B phenylalanine GC 0.010760299 −0.979659887Amphoteracin B phenylalanine nLC 0.76165051 −0.190559006 Amphoteracin Bphenylalanine LC 0.573569375 0.403640768 Amphoteracin B phenylpyruvicacid nLC 1 0 Amphoteracin B phosphate GC 0.983733869 −0.007333333Amphoteracin B phosphoenolpyruvate nLC 1 0 Amphoteracin Bphosphoenolpyruvate pLC 1 0 Amphoteracin B pinitol nLC 1 0 AmphoteracinB pipecolic acid nLC 0.871015411 0.081118937 Amphoteracin B pipecolicacid LC 0.556385814 0.523741811 Amphoteracin B porphobilinogen nLC 1 0Amphoteracin B progesterone pLC 1 0 Amphoteracin B proline nLC0.518220081 0.460347915 Amphoteracin B proline pLC 0.4747621210.670657914 Amphoteracin B pyridoxine nLC 0.708651225 −0.129434556Amphoteracin B pyridoxine pLC 0.776529987 −0.168408149 Amphoteracin Bpyrimidine GC 0.744108261 −0.185 Amphoteracin B retinoic acid nLC 1 0Amphoteracin B riboflavin LC 1 0 Amphoteracin B salicylic/HObenzoic nLC1 0 Amphoteracin B selenoOLmethionine nLC 0.711447529 0.851513124Amphoteracin B selenoDLmethionine pLC 0.888275646 1.177511152Amphoteracin B serine nLC 0.766811518 0.10907441 Amphoteracin B serineLC 0.716422123 0.201058201 Amphoteracin B shikimic acid nLC 1 0Amphoteracin B sinapinic acid nLC 1 0 Amphoteracin B sorbitol/mannitolnLC 0.68492695 0.216175129 Amphoteracin B squalene GC 0.254158772−0.574475175 Amphoteracin B succinic nLC 0.193450596 0.866316251Amphoteracin B sucrose nLC 0.225682636 0.449275362 Amphoteracin B sugar?GC 0.019518223 −0.932993333 Amphoteracin B sugar-phosphate nLC0.878141701 −0.106666667 Amphoteracin B sugar-phosphate pLC 1 0Amphoteracin B tetradecanoic acid GC 0.793963653 0.079666667Amphoteracin B tetradecanoic acid nLC 0.782765706 −0.077232772Amphoteracin B thiamine pLC 1 0 Amphoteracin B threonine/homoserin nLC0.769444989 0.126655553 Amphoteracin B threonine/homoserin pLC0.668114613 0.314511535 Amphoteracin B threonine2 GC 0.073159868−0.855333333 Amphoteracin B threonine3 GC 0.063199416 −0.893333333Amphoteracin B thymine nLC 1 0 Amphoteracin B thymine pLC 1 0Amphoteracin B tms glutamine3 GC 0.003279434 −0.893478913 Amphoteracin Btms lysine4 GC 0.032217789 −0.97833 Amphoteracin B TMS mevalonic acid GC0.012983194 −0.976652217 lactone Amphoteracin B tms tyrosine2 GC0.601581614 −0.359333333 Amphoteracin B tms tyrosine3 GC 0.029953667−0.947315772 Amphoteracin B tryptophan nLC 0.380816515 1.141975309Amphoteracin B tryptophan pLC 1 0 Amphoteracin B tyrosine nLC0.807539229 0.098201061 Amphoteracin B tyrosine LC 0.7351745420.234676626 Amphoteracin B uracil nLC 0.359441135 1.510500389Amphoteracin B uric acid pLC 0.069269066 308 Amphoteracin B uridine nLC0.293422211 0.112573965 Amphoteracin B urocanic acid nLC 1 0Amphoteracin B urocanic acid pLC 1 0 Amphoteracin B valine GC0.026729753 −0.867333333 Amphoteracin B valine nLC 0.7325167590.162425739 Amphoteracin B xanthosine(diH2O) pLC 1 0 Amphoteracin BxanthosineDiH2O nLC 1 0 Amphoteracin B zeatin nLC 1 0 Amphoteracin Bzeatin pLC 1 0 Fluconazole 2-ketobutyric acid nLC 0.225631474−0.999150382 Fluconazole 2-ketoglutaric nLC 0.050037991 −0.999457799Fluconazole 3-indolylacetonitri nLC 0.197015297 −0.999782451 Fluconazole4ambutyrate/dimglyc LC 0.55610932 −0.438329556 Fluconazole4-aminobenzoic acid pLC 1 0 Fluconazole 4aminobutyrate/dimg nLC0.796062459 0.13842334 Fluconazole 4-methylcatechol nLC 1 0 Fluconazole4-methylcatechol pLC 1 0 Fluconazole 5hydroxyLtryptophan nLC 1 0Fluconazole 5hydroxyLtryptophan pLC 1 0 Fluconazole 6benzylaminopurinepLC 1 0 Fluconazole 6-benzylaminopurine nLC 1 0 Fluconazole abscisicacid nLC 1 0 Fluconazole abscisic acid LC 1 0 Fluconazole aconitic acidnLC 0.785890509 0.648259692 Fluconazole adenine nLC 1 0 Fluconazoleadenine pLC 0.842498314 −0.094389696 Fluconazole adenosine nLC 1 0Fluconazole adenosine LC 1 0 Fluconazole alanine GC 0.6720169490.308436145 Fluconazole alanine nLC 0.514232967 0.3989834 Fluconazolealanine/sarcosine LC 0.569965606 0.126948182 Fluconazole allantoic acidnLC 0.693763056 0.239201283 Fluconazole allantoic acid pLC 1 0Fluconazole allantoin nLC 0.201180044 0.394248589 Fluconazoleanthranilic acid nLC 1 0 Fluconazole anthranilic acid LC 1 0 Fluconazolearginine nLC 0.172474156 0.648362584 Fluconazole arginine pLC0.591952135 0.1179275 Fluconazole argininosuccinate nLC 1 0 Fluconazoleargininosuccinate pLC 1 0 Fluconazole asparagine GC 0.5992216410.399866711 Fluconazole asparagine nLC 0.589600334 0.354464539Fluconazole asparagine pLC 0.605531557 0.319224556 Fluconazole asparticnLC 0.515133125 0.499266169 Fluconazole aspartic acid GC 0.6213935790.433666667 Fluconazole aspartic acid pLC 0.67688527 0.214006141Fluconazole benzoic acid nLC 1 0 Fluconazole biotin nLC 0.4059534330.345482947 Fluconazole biotin pLC 1 0 Fluconazole caffeic acid nLC0.584388595 −0.471092077 Fluconazole caffeine pLC 1 0 Fluconazolecampesterol GC 1 0 Fluconazole catechol nLC 1 0 Fluconazole cinnamicacid nLC 1 0 Fluconazole citric acid TME pLC 1 0 Fluconazolecitricanoic/itaconi nLC 1 0 Fluconazole citrulline nLC 1 0 Fluconazolecitrulline pLC 0.821686047 0.082522841 Fluconazole coumaric acid nLC 1 0Fluconazole cytidine nLC 0.067383137 −0.796925943 Fluconazole cytidinepLC 1 0 Fluconazole cytosine nLC 1 0 Fluconazole cytosine pLC 1 0Fluconazole decanoic acid nLC 0.523474499 0.184634286 Fluconazoledesmosterol GC 1 0 Fluconazole diaminopimelic acid nLC 1 0 Fluconazolediaminopimelic acid pLC 1 0 Fluconazole dihydrofolic acid nLC 1 0Fluconazole dihydrofolic acid pLC 1 0 Fluconazole dipicolinic acid pLC 10 Fluconazole disaccaride1 GC 0.581808965 0.388333333 Fluconazoledisaccaride2 GC 0.805350356 0.151666667 Fluconazole disaccaride3 GC0.500838115 0.580333333 Fluconazole DLaminoadipic acid nLC 0.9611481790.443010753 Fluconazole DL-aminoadipic acid LC 0.258675092 0.115480962Fluconazole ergosterol GC 0.411376724 0.948 Fluconazole estrone nLC 1 0Fluconazole farnesol nLC 1 0 Fluconazole folic acid nLC 1 0 Fluconazolefolic acid pLC 1 0 Fluconazole fucosterol GC 0.015716048 6.665Fluconazole fumaric/3m2oxobutan nLC 0.212701071 0.600893928 Fluconazolegallic acid nLC 0.235229644 0.507086324 Fluconazole gibberellic nLC 1 0Fluconazole glucosamine pLC 1 0 Fluconazole glucosamine6PO4 nLC0.273438701 −0.995114007 Fluconazole glucosamine6PO4 pLC 1 0 Fluconazoleglutamate pLC 0.883828911 −0.061793299 Fluconazole glutamic/acetylserinLC 0.56055075 0.384161186 Fluconazole glutamine GC 0.4858439910.524174725 Fluconazole glutamine/lysinen LC 0.609631316 0.330898992Fluconazole glutamine/lysinep LC 0.670624203 0.224216219 Fluconazoleglutathione pLC 0.92752344 −0.058315351 Fluconazole glycanopyrose GC0.347157825 1.202333333 Fluconazole glycerol GC 0.668832185 0.218666667Fluconazole glycine GC 0.802369966 −0.103666667 Fluconazole guanine nLC1 0 Fluconazole guanosine nLC 0.285463594 −0.992974239 Fluconazoleguanosine LC 0.060854626 −0.998137803 Fluconazole hexadecanoic acid GC0.652442377 0.134333333 Fluconazole histidine nLC 1 0 Fluconazolehistidine pLC 1 0 Fluconazole homogentisic/uric nLC 1 0 Fluconazolehydrocortisone nLC 1 0 Fluconazole hydrocortisone LC 1 0 Fiuconazolehypoxanthine nLC 0.259732062 0.77571885 Fluconazole hypoxanthine pLC0.736842203 0.129759971 Fluconazole indole3pyruvic acid nLC 1 0Fluconazole inostol/glucos/sorb nLC 0.57332042 −0.47442546 Fluconazoleiso citric acid GC 0.588523447 0.392333333 Fluconazoleisocitric/citric/qu nLC 0.288980226 1.457227098 Fluconazole isoleucineGC 0.634637433 0.391 Fluconazole itaconic acid dimes pLC 1 0 Fluconazolejasmonic acid nLC 1 0 Fluconazole kinetin nLC 1 0 Fluconazole kinetinpLC 1 0 Fluconazole lactic acid nLC 0.90233218 −0.043897702 Fluconazolelanosterol GC 0.021305043 8.462333333 Fluconazole lauric acid nLC0.405736617 0.390567367 Fluconazole leucine GC 0.655160145 0.338666667Fluconazole leucine/isoleucine/ nLC 0.610189969 0.330601522 Fluconazoleleucine/isoleucine/ pLC 0.684383809 0.163833602 Fluconazole luteolin nLC1 0 Fluconazole luteolin pLC 1 0 Fluconazole lysine GC 0.596764160.341219594 Fluconazole malic acid GC 0.629662238 0.397534155Fluconazole malic acid nLC 0.575009587 0.43661293 Fluconazole malonicacid nLC 1 0 Fluconazole mannitol pLC 0.743695348 0.151992706Fluconazole menthol* nLC 0.860810154 0.047582203 Fluconazole methioninenLC 1 0 Fluconazole methionine pLC 0.279599722 −0.290574597 Fluconazolemevalonic acid GC 0.704278777 0.233255582 lactone Fluconazole mevaloniclactone pLC 0.241778766 −0.517489712 Fluconazole NacetylDglucosamine nLC1 0 Fluconazole NacetylDglucosamine pLC 1 0 FluconazoleNacetylLglutamate nLC 0.839927069 0.136915888 FluconazoleNacetylLglutamate pLC 1 0 Fluconazole NacetylLornithine nLC 1 0Fluconazole NacetylLornithine pLC 0.718034342 0.158396947 Fluconazoleniacinamide pLC 1 0 Fluconazole nicotinic acid nLC 0.0615365852.442536328 Fluconazole nicotinic acid pLC 0.052262619 −0.79822695Fluconazole nopaline nLC 0.350953395 0.343189964 Fluconazole nopalinepLC 1 0 Fluconazole octadecanoic acid CC 0.889163721 0.082 Fluconazoleoleic acid GC 0.364873247 −0.307333333 Fluconazole oleic acid nLC0.966227899 −0.016835748 Fluconazole ornithine nLC 0.6033763920.350845648 Fluconazole ornithine pLC 0.464434284 0.477540988Fluconazole ornithine2 CC 0.607787058 0.447333333 Fluconazole ornithine3GC 0.69362274 0.264666667 Fluconazole orotic acid nLC 1 0 Fluconazolepalmiteliadic acid CC 0.813004804 −0.088666667 Fluconazole palmitic acidnLC 0.914973348 −0.023301814 Fluconazole phenylalanine CC 0.7117879490.277425809 Fluconazole phenylalanine nLC 0.763353558 0.152670808Fluconazole phenylalanine pLC 0.843620891 −0.180920325 Fluconazolephenylpyruvic acid nLC 1 0 Fluconazole phosphate CC 0.147008309−0.507996667 Fluconazole phosphoenolpyruvate nLC 1 0 Fluconazolephosphoenolpyruvate pLC 1 0 Fluconazole pinitol nLC 1 0 Fluconazolepipecolic acid nLC 0.651219102 0.290059228 Fluconazole pipecolic acidpLC 0.666832577 0.25832748 Fluconazole porphobilinogen nLC 1 0Fluconazole progesterone pLC 1 0 Fluconazole proline nLC 0.5465945430.416410847 Fluconazole proline pLC 0.606102286 0.207549593 Fluconazolepyridoxine nLC 0.922916545 −0.042014772 Fluconazole pyridoxine pLC0.441455035 −0.383106649 Fluconazole pyrimidine CC 0.736324370.245666667 Fluconazole retinoic acid nLC 1 0 Fluconazole riboflavin pLC1 0 Fluconazole salicylic/HObenzoic nLC 1 0 FluconazoleselenoDLmethionine nLC 0.319500806 −0.574763923 FluconazoleselenoDLmethionine pLC 0.232642988 −0.686509768 Fluconazole serine nLC0.582348829 0.393647913 Fluconazole serine LC 0.76807688 0.109960893Fluconazole shikimic acid nLC 1 0 Fluconazole sinapinic acid nLC 1 0Fluconazole sorbitol/mannitol nLC 0.591808093 0.364953887 Fluconazolesqualene GC 0.602775269 0.199933356 Fluconazole succinic nLC 0.25222130.309417433 Fluconazole sucrose nLC 0.241405138 0.310410154 Fiuconazolesugar? GC 0.580258174 0.481666667 Fluconazole sugar-phosphate nLC0.956717825 −0.053057471 Fluconazole sugar-phosphate pLC 1 0 Fluconazoletetradecanoic acid GC 0.856705431 0.079666667 Fluconazole tetradecanoicacid nLC 0.46350082 0.54945313 Fluconazole thiamine pLC 1 0 Fluconazolethreonine/homoserin nLC 0.608964827 0.325738631 Fluconazolethreonine/homoserin pLC 0.718235353 0.160353176 Fluconazole threonine2GC 0.525080919 0.505333333 Fluconazole threonine3 GC 0.753497460.251333333 Fluconazole thymine nLC 1 0 Fluconazole thymine pLC 1 0Fluconazole tms glutamine3 GC 0.254931664 0.727121187 Fluconazole tmslysine4 GC 0.627281408 0.365666667 Fluconazole TMS mevalonic acid GC0.822784777 0.143381127 lactone Fluconazole tms tyrosine2 GC 0.4075030960.864333333 Fluconazole tms tyrosine3 GC 0.646523562 0.332110704Fluconazole tryptophan nLC 0.360511436 1.648709315 Fluconazoletryptophan pLC 1 0 Fluconazole tyrosine nLC 0.701987245 0.230338937Fluconazole tyrosine pLC 0.761710986 0.109881652 Fluconazole uracil nLC0.357108991 1.256157636 Fluconazole uric acid pLC 1 0 Fluconazoleuridine nLC 0.242998296 0.346301775 Fluconazole urocanic acid nLC 1 0Fluconazole urocanic acid pLC 1 0 Fluconazole valine GC 0.7118432120.272666667 Fluconazole valine nLC 0.6138852 0.323524419 Fluconazolexanthosine(diH2O) pLC 1 0 Fluconazole xanthosineDiH2O nLC 1 0Fluconazole zeatin nLC 1 0 Fluconazole zeatin pLC 1 0 Ketoconazole2-ketobutyric acid nLC 0.9639671 0.480600397 Ketoconazole 2-ketoglutaricnLC 0.050037991 −0.999457799 Ketoconazole 3-indolylacetonitri nLC0.95501953 0.699782451 Ketoconazole 4ambutyrate/dimglyc pLC 0.5241370710.584817093 Ketoconazole 4-aminobenzoic acid pLC 1 0 Ketoconazole4aminobutyrate/dimg nLC 0.461393936 −0.485785227 Ketoconazole4-methylcatechol nLC 1 0 Ketoconazole 4-methylcatechol pLC 1 0Ketoconazole 5hydroxyLtryptophan nLC 1 0 Ketoconazole5hydroxyLtryptophan pLC 1 0 Ketoconazole 6benzylaminopurine pLC 1 0Ketoconazole 6-benzylaminopurine nLC 1 0 Ketoconazole abscisic acid nLC1 0 Ketoconazole abscisic acid pLC 1 0 Ketoconazoie aconitic acid nLC0.67459115 0.635606581 Ketoconazole adenine nLC 1 0 Ketoconazole adeninepLC 0.996845972 0.018126006 Ketoconazole adenosine nLC 1 0 Ketoconazoleadenosine pLC 0.061512704 549.3333333 Ketoconazole alanine GC0.742203249 0.23141047 Ketoconazole alanine nLC 0.560597277 −0.528917036Ketoconazole alanine/sarcosine pLC 0.571450791 −0.504395897 Ketoconazoleallantoic acid nLC 0.151749563 −0.686669081 Ketoconazole allantoic acidpLC 1 0 Ketoconazole allantoin nLC 0.888672729 −0.340295275 Ketoconazoleanthranilic acid nLC 1 0 Ketoconazole anthranilic acid pLC 1 0Ketoconazole arginine nLC 0.031257842 −0.999961229 Ketoconazole argininepLC 0.028481658 −0.996209523 Ketoconazole argininosuccinate nLC 1 0Ketoconazole argininosuccinate pLC 1 0 Ketoconazole asparagine GC0.381635461 0.583138954 Ketoconazole asparagine nLC 0.594223659−0.405454029 Ketoconazole asparagine pLC 0.807721418 −0.106515886Ketoconazole aspartic nLC 0.620101115 −0.403930348 Ketoconazole asparticacid GC 0.664887605 0.299666667 Ketoconazole aspartic acid pLC0.794913561 −0.208404622 Ketoconazole benzoic acid nLC 0.2170272361719.333333 Ketoconazole biotin nLC 0.981428203 −0.180942463Ketoconazole biotin LC 1 0 Ketoconazole caffeic acid nLC 0.22121578−0.429418547 Ketoconazole caffeine pLC 1 0 Ketoconazole campesterol GC 10 Ketoconazole catechol nLC 1 0 Ketoconazole cinnamic acid nLC 1 0Ketoconazole citric acid TME pLC 1 0 Ketoconazole citricanoic/itaconinLC 0.219726535 1522.333333 Ketoconazole citrulline nLC 1 0 Ketoconazolecitrulline pLC 0.889543516 0.651144513 Ketoconazole coumaric acid nLC 10 Ketoconazole cytidine nLC 0.102328077 −0.36143456 Ketoconazolecytidine pLC 0.068336435 393.6666667 Ketoconazole cytosine nLC 1 0Ketoconazole cytosine pLC 1 0 Ketoconazole decanoic acid nLC 0.18889589−0.387872406 Ketoconazole desmosterol GC 1 0 Ketoconazole diaminopimelicacid nLC 0.208740638 2537 Ketoconazole diaminopimelic acid PLC 1 0Ketoconazole dihydrofolic acid nLC 1 0 ketoconazole dihydrofolic acidpLC 1 0 Ketoconazole dipicolinic acid pLC 1 0 Ketoconazole disaccaride1GC 0.247275227 1.231666667 Ketoconazole disaccaride2 GC 0.57432915 0.405Ketoconazole disaccaride3 GC 0.273927592 1.143666667 KetoconazoleDLaminoadipic acid nLC 0.282562804 −0.993548387 KetoconazoleDL-aminoadipic acid pLC 0.041949247 −0.999749499 Ketoconazole ergosterolGC 0.457850979 0.792666667 Ketoconazole estrone nLC 1 0 Ketoconazolefarnesol nLC 1 0 Ketoconazole folic acid nLC 1 0 Ketoconazole folic acidLC 1 0 Ketoconazole fucosterol GC 0.007283106 7.146333333 Ketoconazolefumaric/3m2oxobutan nLC 0.879463953 −0.442058214 Ketoconazole gallicacid nLC 0.341926441 −0.699797534 Ketoconazole gibberellic nLC 1 0Ketoconazole glucosamine LC 1 0 Ketoconazole glucosamine6PO4 nLC0.993629524 −0.058631922 Ketoconazole glucosamine6PO4 pLC 0.07458733260.6666667 Ketoconazole glutamate LC 0.713626372 −0.245514762Ketoconazole glutamic/acetylseri nLC 0.537323804 −0.522523365Ketoconazole glutamine GC 0.335902006 0.709569857 Ketoconazolegiutamine/lysine nLC 0.673999294 −0.320397038 Ketoconazoleglutamine/lysine LC 0.788208454 −0.193172287 Ketoconazole glutathione LC0.911022134 −0.057412545 Ketoconazole glycanopyrose GC 0.225636823 1.592Ketoconazole glycerol GC 0.126154516 0.915 Ketoconazole glycine GC0.896523858 −0.059666667 Ketoconazole guanine nLC 1 0 Ketoconazoleguanosine nLC 0.285463594 −0.992974239 Ketoconazole guanosine LC0.232818183 0.862197393 Ketoconazole hexadecanoic acid GC 0.5549362070.373 Ketoconazole histidine nLC 1 0 Ketoconazole histidine LC 1 0Ketoconazole homogentisic/uric nLC 1 0 Ketoconazole hydrocortisone nLC 10 Ketoconazole hydrocortisone pLC 1 0 Ketoconazole hypoxanthine nLC0.417257665 −0.476677316 Ketoconazole hypoxanthine pLC 0.721865016−0.280334476 Ketoconazole indole3pyruvic acid nLC 1 0 Ketoconazoleinostol/glucos/sorb nLC 0.887855007 0.315683171 Ketoconazole iso citricacid GC 0.26048524 0.964333333 Ketoconazole isocitric/citric/qu nLC0.977182788 −0.248481007 Ketoconazole isoleucine GC 0.5939301990.453666667 Ketoconazole itaconic acid dimes pLC 1 0 Ketoconazolejasmonic acid nLC 1 0 Ketoconazole kinetin nLC 1 0 Ketoconazole kinetinpLC 1 0 Ketoconazole lactic acid nLC 0.355289051 0.475040466Ketoconazole lanosterol GC 0.013296827 8.435666667 Ketoconazole lauricacid nLC 0.921247829 −0.042003398 Ketoconazole leucine GC 0.5107220730.628333333 Ketoconazole leucine/isoleucine/ nLC 0.69030324 −0.339497239Ketoconazole leucine/isoleucine/ pLC 0.694390781 −0.279509064Ketoconazole luteolin nLC 1 0 Ketoconazole luteolin LC 1 0 Ketoconazolelysine GC 0.312893118 0.702765745 Ketoconazole malic acid GC 0.2304401960.374878374 Ketoconazole malic acid nLC 0.741534381 −0.202679583Ketoconazole malonic acid nLC 0.229622684 993.6666667 Ketoconazolemannitol LC 0.075185984 −0.482808023 Ketoconazole menthol* nLC0.894522787 0.0558346 Ketoconazole methionine nLC 1 0 Ketoconazolemethionine LC 0.000132547 −0.999243952 Ketoconazole mevalonic acid GC0.299567095 0.345884705 lactone Ketoconazole mevalonic lactone pLC0.000458094 −0.999742798 Ketoconazole NacetylDglucosamine nLC 1 0Ketoconazole NacetylDglucosamine pLC 1 0 Ketoconazole NacetylLglutamatenLC 0.766840163 −0.16152648 Ketoconazole NacetylLglutamate LC0.000379693 1232.333333 Ketoconazole NacetylLornithine nLC 1 0Ketoconazole NacetylLornithine pLC 0.904993806 −0.081367106 Ketoconazoleniacinamide LC 1 0 Ketoconazole nicotinic acid nLC 0.995525033−0.042272127 Ketoconazole nicotinic acid pLC 7.53474E-05 −0.99893617Ketoconazole nopaline nLC 0.065964767 −0.997311828 Ketoconazole nopalineLC 1 0 Ketoconazole octadecanoic acid GC 0.241136181 0.512333333Ketoconazole oleic acid GC 0.457404638 −0.388333333 Ketoconazole oleicacid nLC 0.15269473 −0.526000068 Ketoconazole ornithine nLC 0.240401148−0.414620442 Ketoconazole ornithine pLC 0.216567154 −0.871917457Ketoconazole ornithine2 GC 0.08811091 −0.782333333 Ketoconazoleornithine3 GC 0.511712533 0.486333333 Ketoconazole orotic acid nLC0.218957363 1575.666667 Ketoconazole palmiteliadic acid GC 0.690588295−0.187 Ketoconazole palmitic acid nLC 0.55407711 −0.44537984Ketoconazole phenylalanine GC 0.570127457 0.364454818 Ketoconazolephenylalanine nLC 0.190982317 −0.591801 242 Ketoconazole phenylalaninepLC 0.202078489 −0.36668569 Ketoconazole phenylpyruvic acid nLC 1 0Ketoconazole phosphate GC 0.602201543 −0.268333333 Ketoconazolephosphoenolpyruvate nLC 1 0 Ketoconazole phosphoenolpyruvate pLC 1 0Ketoconazole pinitol . nLC 0.244808608 545 Ketoconazole pipecolic acidnLC 0.223402828 −0.41143749 Ketoconazole pipecolic acid pLC 0.804903885−0.15667062 Ketoconazole porphobilinogen nLC 1 0 Ketoconazoleprogesterone pLC 1 0 Ketoconazole proline nLC 0.637191689 −0.353351241Ketoconazole proline pLC 0.793251122 −0.161195947 Ketoconazolepyridoxine nLC 0.839423897 −0.092505146 Ketoconazole pyridoxine pLC0.894790663 −0.075068589 Ketoconazole pyrimidine GC 0.2968539110.740333333 Ketoconazole retinoic acid nLC 1 0 Ketoconazole riboflavinpLC 1 0 Ketoconazole salicylic/HObenzoic nLC 1 0 KetoconazoleselenoDLmethionine nLC 0.617802219 0.965499294 KetoconazoleselenoDLmethionine pLC 0.501432519 1.149746193 Ketoconazofe serine nLC0.602918586 −0.469419238 Ketoconazole serine pLC 0.705817734−0.313779618 Ketoconazole shikimic acid nLC 0.159073415 49446.33333Ketoconazole sinapinic acid nLC 1 0 Ketoconazole sorbitol/mannitol nLC0.326913111 0.469342252 Ketoconazole squalene GC 0.646962325 0.437187604Ketoconazole succinic nLC 0.934705564 −0.266228647 Ketoconazole sucrosenLC 0.356348305 −0.516908213 Ketoconazole sugar? GC 0.5180283980.534666667 Ketoconazole sugar-phosphate nLC 0.607811705 −0.290298851Ketoconazole sugar-phosphate pLC 0.065129247 385 Ketoconazoletetradecanoic acid GC 0.542673889 0.259333333 Ketoconazole tetradecanoicacid nLC 0.826830708 −0.141716433 Ketoconazole thiamine pLC 1 0Ketoconazole threonine/homoserin nLC 0.720684532 −0.320459387Ketoconazole threonine/homoserin pLC 0.729834457 −0.252954999Ketoconazole threonine2 GC 0.369980722 0.630333333 Ketoconazolethreonine3 GC 0.771315792 0.184666667 Ketoconazole thymine nLC 1 0Ketoconazole thymine pLC 1 0 Ketoconazole tms glutamine3 GC 0.2252438150.826971162 Ketoconazole tms lysine4 GC 0.548698451 0.452 KetoconazoleTMS mevalonic acid GC 0.083516634 −0.859283094 lactone Ketoconazole tmstyrosine2 GC 0.215698651 1.561666667 Ketoconazole tms tyrosine3 GC0.505545511 0.437812604 Ketoconazole tryptophan nLC 0.9971018960.03030303 Ketoconazole tryptophan LC 1 0 Ketoconazole tyrosine nLC0.682093146 −0.323276916 Ketoconazole tyrosine pLC 0.774022599−0.222804007 Ketoconazole uracil nLC 0.223581594 −0.999222193Ketoconazole uric acid pLC 1 0 Ketoconazole uridine nLC 0.327767929−0.740828402 Ketoconazole urocanic acid nLC 0.253172611 401.3333333Ketoconazole urocanic acid pLC 1 0 Ketoconazole valine GC 0.6343370150.357 Ketoconazole valine nLC 0.630670374 −0.382933416 Ketoconazolexanthosine(diH2O) LC 1 0 Ketoconazole xanthosineDiH2O nLC 1 0Ketoconazole zeatin nLC 1 0 Ketoconazole zeatin pLC 1 0 Posaconazole2-ketobutyric acid nLC 0.225631474 −0.999150382 Posaconazole2-ketoglutaric nLC 0.578339703 11.32767034 Posaconazole3-indolylacetonitri nLC 0.197015297 −0.999782451 Posaconazole4ambutyrate/dimglyc pLC 0.963777302 31.37374555 Posaconazole4-aminobenzoic acid pLC 1 0 Posaconazole 4aminobutyrate/dimg nLC0.934446326 −0.008938547 Posaconazole 4-methylcatechol nLC 1 0Posaconazole 4-methylcatechol nLC 1 0 Posaconazole 5hydroxyLtryptophannLC 1 0 Posaconazole 5hydroxyLtryptophan LC 1 0 Posaconazole6benzylaminopurine pLC 1 0 Posaconazole 6-benzylaminopurine nLC 1 0Posaconazole abscisic acid nLC 1 0 Posaconazole abscisic acid LC 1 0Posaconazole aconitic acid nLC 0.14418007 1.874075272 Posaconazoleadenine nLC 1 0 Posaconazole adenine LC 0.97905854 −0.014499036Posaconazole adenosine nLC 0.288782643 128.3333333 Posaconazoleadenosine pLC 1 0 Posaconazole alanine GC 0.443509194 1.035345115Posaconazole alanine nLC 0.665454482 0.109114473 Posaconazolealanine/sarcosine pLC 0.572428945 0.101305448 Posaconazole allantoicacid nLC 0.591698395 0.332057317 Posaconazole allantoic acid pLC 1 0Posaconazole allantoin nLC 0.125054459 5.970180955 Posaconazoleanthranilic acid nLC 1 0 Posaconazole anthranilic acid pLC 1 0Posaconazole arginine nLC 0.209275262 0.17072036 Posaconazole argininepLC 0.655241349 0.033825172 Posaconazole argininosuccinate nLC0.259815185 318.3333333 Posaconazole argininosuccinate pLC 1 0Posaconazole asparagine GC 0.915263335 0.144951683 Posaconazoleasparagine nLC 0.898337684 0.00686488 Posaconazole asparagine pLC0.571388297 0.430210016 Posaconazole aspartic nLC 0.6398309630.304975124 Posaconazole aspartic acid GC 0.860113055 −0.071Posaconazole aspartic acid LC 0.605000404 0.362551855 Posaconazolebenzoic acid nLC 1 0 Posaconazole biotin nLC 0.22367742 −0.999218953Posaconazole biotin LC 1 0 Posaconazole caffeic acid nLC 0.580263509−0.490034591 Posaconazole caffeine pLC 1 0 Posaconazole campesterol GC 10 Posaconazole catechol nLC 1 0 Posaconazole cinnamic acid nLC 1 0Posaconazole citric acid TME pLC 1 0 Posaconazole citricanoic/itaconinLC 1 0 Posaconazole citrulline nLC 0.225951875 1160 Posaconazolecitrulline pLC 0.777333431 0.10109048 Posaconazole coumaric acid nLC 1 0Posaconazole cytidine nLC 0.000680198 −0.998602701 Posaconazole cytidinepLC 1 0 Posaconazole cytosine nLC 1 0 Posaconazole cytosine pLC0.226462948 1135 Posaconazole decanoic acid nLC 0.787160954 0.126277917Posaconazole desmosterol GC 1 0 Posaconazole diaminopimelic acid nLC 1 0Posaconazole diaminopimelic acid pLC 1 0 Posaconazole dihydrofolic acidnLC 1 0 Posaconazole dihydrofolic acid pLC 1 0 Posaconazole dipicolinicacid LC 1 0 Posaconazole disaccaride1 GC 0.958378084 0.047333333Posaconazole disaccaride2 GC 0.718230465 0.313 Posaconazole disaccaride3GC 0.830961848 0.340666667 Posaconazole DLaminoadipic acid nLC0.282562804 −0.993548387 Posaconazole DL-aminoadipic acid pLC 0.179737361.684786239 Posaconazole ergosterol GC 0.485731041 0.808 Posaconazoleestrone nLC 1 0 Posaconazole farnesol nLC 1 0 Posaconazole folic acidnLC 1 0 Posaconazole folic acid PLC 1 0 Posaconazole fucosterol GC0.006770761 6.722333333 Posaconazole fumaric/3m2oxobutan nLC 0.043783124−0.999672953 Posaconazole gallic acid nLC 0.246548839 0.441376772Posaconazole gibberellic nLC 1 0 Posaconazole glucosamine pLC 1 0Posaconazole glucosamine6PO4 nLC 0.273438701 −0.995114007 Posaconazoleglucosamine6PO4 pLC 1 0 Posaconazole glutamate pLC 0.5532710670.376935809 Posaconazole glutamic/acetylseri nLC 0.593398809 0.322249352Posaconazole glutamine GC 0.460619522 0.643881294 Posaconazoleglutamine/lysine nLC 0.821564835 0.098138243 Posaconazoleglutamine/lysine pLC 0.621404602 0.303010036 Posaconazole glutathionepLC 0.960269566 0.010099676 Posaconazole glycanopyrose GC 0.2887165051.593666667 Posaconazole glycerol GC 0.615962586 0.187333333Posaconazole glycine GC 0.96573947 0.082666667 Posaconazole guanine nLC1 0 Posaconazole guanosine nLC 0.285463594 −0.992974239 Posaconazoleguanosine pLC 0.988504987 0.046554935 Posaconazole hexadecanoic acid GC0.773386672 −0.018666667 Posaconazole histidine nLC 1 0 Posaconazolehistidine pLC 1 0 Posaconazole homogentisic/uric nLC 1 0 Posaconazolehydrocortisone nLC 1 0 Posaconazole hydrocortisone pLC 1 0 Posaconazolehypoxanthine nLC 0.966438425 0.175079872 Posaconazole hypoxanthine pLC0.724112426 0.134993684 Posaconazole indole3pyruvic acid nLC 1 0Posaconazole nostol/glucos/sorb nLC 0.570836266 −0.492985425Posaconazole iso citric acid GC 0.584793588 0.4 Posaconazoleisocitric/citric/qu nLC 0.282679268 1.710695637 Posaconazole isoleucineGC 0.815398307 0.102333333 Posaconazole laconic acid dimes pLC 1 0Posaconazole jasmonic acid nLC 1 0 Posaconazole kinetin nLC 1 0Posaconazole kinetin pLC 1 0 Posaconazole lactic acid nLC 0.6717352460.598705083 Posaconazole lanosterol GC 0.025813439 7.463666667Posaconazole lauric acid nLC 0.184704286 0.983298106 Posaconazoleleucine GC 0.605917046 0.334 Posaconazole leucine/isoleucine/ nLC0.852184303 0.031580645 Posaconazole leucine/isoleucine/ pLC 0.7720874660.049372553 Posaconazole luteolin nLC 1 0 Posaconazoie luteolin LC 1 0Posaconazole lysine GC 0.738361003 0.158613795 Posaconazole malic acidGC 0.620850674 0.294235255 Posaconazole malic acid nLC 0.6716505380.310055664 Posaconazole malonic acid nLC 1 0 Posaconazole mannitol pLC0.562416384 0.391898932 Posaconazole menthol* nLC 0.804746729 0.08120719Posaconazole methionine nLC 1 0 Posaconazole methionine pLC 0.250499977−0.399697581 Posaconazole mevalonic acid GC 0.299851368 0.555481506lactone Posaconazole mevalonic lactone pLC 0.315594728 −0.083676269Posaconazole NacetylDglucosamine nLC 1 0 PosaconazoleNacetylLglucosamine pLC 1 0 Posaconazole NacetylLglutamate nLC0.892160969 −0.07165109 Posaconazole NacetylLglutamate pLC 1 0Posaconazole NacetylLornithine nLC 1 0 Posaconazole NacetylLornithinepLC 0.729225825 0.133761277 Posaconazole niacinamide pLC 1 0Posaconazole nicotinic acid nLC 0.405290885 1.421400264 Posaconazolenicotinic acid pLC 0.050517148 −0.814184397 Posaconazole nopaline nLC0.251582538 0.755376344 Posaconazole nopaline pLC 1 0 Posaconazoleoctadecanoic acid GC 0.456065185 0.321 Posaconazole oleic acid GC0.32058481 −0.505666667 Posaconazole oleic acid nLC 0.4220012560.5606148 Posaconazole ornithine nLC 0.434565877 2.313985503Posaconazole ornithine pLC 0.497418018 0.4313985 Posaconazole ornithine2GC 0.692849376 −0.138333333 Posaconazole ornithine3 GC 0.965157424 0.041Posaconazole orotic acid nLC 1 0 Posaconazole palmiteliadic acid GC0.59834357 −0.245333333 Posaconazole palmitic acid nLC 0.7548173550.151759295 Posaconazole phenylalanine GC 0.931065787 0.060686896Posaconazole phenylalanine nLC 0.879237271 −0.037018634 Posaconazolephenylalanine pLC 0.851071927 −0.15852366 Posaconazole phenylpyruvicacid nLC 1 0 Posaconazole phosphate GC 0.697409194 −0.183333333Posaconazole phosphoenolpyruvate nLC 1 0 Posaconazolephosphoenolpyruvate LC 1 0 Posaconazole pinitol nLC 1 0 Posaconazolepipecolic acid nLC 0.742425122 0.180566672 Posaconazole pipecolic acidpLC 0.631202801 0.344440479 Posaconazote porphobilinogen nLC 1 0Posaconazole progesterone pLC 1 0 Posaconazole proline nLC 0.6106942740.31465848 Posaconazole proline LC 0.536721684 0.315655773 Posaconazolepyridoxine nLC 0.826237958 −0.101465068 Posaconazole pyridoxine pLC0.680775731 −0.238981963 Posaconazole pyrimidine GC 0.727269762 0.159Posaconazole retinoic acid nLC 1 0 Posaconazole riboflavin pLC 1 0Posaconazole salicylic/HObenzoic nLC 1 0 Posaconazole selenoDLmethioninenLC 0.205679572 −0.672168934 Posaconazole selenoDLmethionine pLC0.20869348 −0.70827565 Posaconazole serine nLC 0.731434141 0.15199637Posaconazole serine pLC 0.687746939 0.241660916 Posaconazole shikimicacid nLC 1 0 Posaconazole sinapinic acid nLC 1 0 Posaconazolesorbitol/mannitol nLC 0.98832955 0.009121314 Posaconazole squalene GC0.914326664 0.074308564 Posaconazole succinic nLC 0.310191321−0.061410425 Posaconazole sucrose nLC 0.251150065 0.184711566Posaconazole sugar? GC 0.554618941 0.692333333 Posaconazolesugar-phosphate nLC 0.96801556 −0.045425287 Posaconazole sugar-phosphatepLC 1 0 Posaconazole tetradecanoic acid GC 0.177413951 −0.213996667Posaconazole tetradecanoic acid nLC 0.321679695 1.028651949 Posaconazolethiamine pLC 1 0 Posaconazole threonine/homoserin nLC 0.8574015390.040335278 Posaconazole threonine/homoserin pLC 0.729627613 0.143726858Posaconazole threonine2 GC 0.57620146 0.457 Posaconazole threonine3 GC0.770574923 0.225 Posaconazole thymine nLC 1 0 Posaconazole thymine pLC1 0 Posaconazole tms glutamine3 GC 0.344885321 0.670945158 Posaconazoletms lysine4 GC 0.950359185 0.056666667 Posaconazole TMS mevalonic acidGC 0.544918115 −0.31177059 lactone Posaconazole tms tyrosine2 GC0.380871273 1.228333333 Posaconazole tms tyrosine3 GC 0.7443728210.191063688 Posaconazole tryptophan nLC 0.300215958 4.67620651Posaconazole tryptophan pLC 1 0 Posaconazole tyrosine nLC 0.8925023440.028199192 Posaconazole tyrosine pLC 0.766496027 0.100772162Posaconazole uracil nLC 0.969596144 −0.27793622 Posaconazole uric acidpLC 1 0 Posaconazole uridine nLC 0.29522738 0.028550296 Posaconazoleurocanic acid nLC 1 0 Posaconazole urocanic acid pLC 1 0 Posaconazolevaline GC 0.599781403 0.345 Posaconazole valine nLC 0.8689835930.019504668 Posaconazole xanthosine(diH2O) pLC 1 0 PosaconazolexanthosineDiH2O nLC 1 0 Posaconazole zeatin nLC 1 0 Posaconazole zeatinpLC 1 0

[0388] The four antifungal drugs examined in the present study,Amphoteracin B, Ketoconazole, Fluconazole, and Posaconazole, are knownto have different effects when applied therapeutically. They are alsoquite different structurally, as is shown in FIG. 18, so it is not clearwhich characteristics are responsible for their differences. Therefore,it is desirable to determine how the compounds differentially interactwithin living cells, including the cells of pathogens and the cells ofpatients. The present experiment is designed to address these questionsby examining which pathways in yeast cells (a pathogen) are affected bythe four antifungal compounds. Current state of the art limitationsdictate that experiments examining different biological entities (DNA,RNA, protein, metabolites, phenotype) be designed and performed inindividual technologies, or be designed and performed simultaneously orsequentially using different technologies, with disparate results thencompared indirectly and analyzed. The present invention provides methodsfor obtaining integrated data from different technologies so that directcomparison and analysis are possible, enabling use of the mostinformative of data from as many different biological sources ortechnologies as a biologist elects to integrate. The methods set forthin the present invention lead to complex data sets, which hold vastamounts of data. Various specific examples of the present invention areprovided. The herbicide site of action study presented in SpecificExample 2 (SOA1) provides a coherent data set obtained from threedifferent biological sources via integrated technologies, with the datacombined for greatest gain of biological information. The herbicide modeof action study presented in Specific Example 3 (MOA1) provides acoherent data set obtained from three different biological sources viaintegrated technologies, with the data combined for greatest gain ofbiological information. MOA1 additionally provides for the use of afourth technology, nutritional profiling, for use in guiding theanalyses of the results from gene expression, metabolite, and phenotypictechnologies. The antifungal study addressed in Specific Example 5,hereinafter AF1, presents an integrated data set for the identificationof biochemical pathways associated with the effects of the drugs inquestion. A full analysis of the AF1 data set requires linkage of datato the affected biochemical pathways, so that the observed effects ofeach on both pathogen and patient are understood.

[0389] In AF1, two different technologies were utilized: gene expressionanalysis (for examination of mRNA expression) and metabolite analysis.More than 6300 genes were measured by gene expression and more than 600chemical components were measured by LC-MS and GC-MS. As notedpreviously, existing metabolic databases may be helpful in practicingthe methods and systems of the present invention, but many databasesinclude limitations that make their use in data analysis and pathwaymapping less than straightforward. In the case of AF1, use of the KEGGdatabase to map gene information to pathways resulted in the mapping of1145 significantly changed genes to a total of 103 pathways. A caveatlimiting the reliance on the mapping data is that KEGG mapping is notunique (one gene does not map to a single pathway), and 45% of the genesmapped to more than one pathway, as shown in FIG. 19. This caveat toKEGG makes it difficult to pinpoint the correct pathway when attemptingto link a gene to a specific pathway.

[0390] Since KEGG provides multiple pathway linkages for some genes(FIG. 19) and some compounds (FIG. 20), with seven compounds mapping tomore than 10 pathways (Table 13), the invention provides a method forassigning pathway scores when mapping genes and compounds to pathways.TABLE 13 Compounds Linked to More than 10 Pathways Kegg ID Compound #Pathways C00009 phosphate 40 C00025 L-glutamate 30 C00026 2-ketoglutaricacid 27 C00049 L-aspartic acid 20 C00065 L-serine 12 C00078 L-tryptophan11 C00109 2-ketobutyric acid 12

[0391] The pathway score indicates how meaningful the mapping is, or howlikely it is to be correctly indicative of the pathway involved in theperturbation under examination. The method provides a pathway scorebased on perturbation levels of genes and/or compounds and theinformation content of each gene and/or compound in the pathway, i.e., apathway score indicates the extent to which other pathways are mapped toa gene/compound. For example, imagine that two genes are perturbed in aparticular experiment. One gene maps to only one pathway, giving a highdegree of probability that the perturbed pathway is the one identifiedin the mapping. The second gene maps to three pathways. In the laterexample, there is only one-third the probability that the pathwayidentified in the mapping is the one perturbed. The present inventionprovides a method for calculating the pathway scores, so that moreweight is given to a score of a gene or compound that maps to only onepathway than to a score of a gene or compound that maps to multiplepathways. Equation 1, a simplified example of this sort of calculationthat does not take into account the degree of perturbation, follows:${path\_ score} = \frac{\sum\limits_{i = 1}^{j}\quad \frac{1}{i_{path\_ count}}}{n}$

[0392] Where n=the total number of genes in the pathway; i_(path)_(count) =the number of pathways containing a gene; and j=the number ofgenes in the pathway that are perturbed. Another factor to be consideredwhen weighting a pathway score is the degree of perturbation. Degree ofperturbation can be calculated, for example, based on a number ofstandard deviations from a norm, and included in an equation such as theone shown above, so that not only the number of pathways is taken intoaccount, but also accounts for the amount of gene transcript, orcompound present as compared to a control.

[0393] Compounds were also linked to pathways using the KEGG database.KEGG links 676 compounds measured in AF1 to a total of 92 pathways. Ofthe 676 compounds under consideration, 77 were detected in the AF1samples. The 77 compounds map to 69 pathways, with approximately 68% ofthe compounds mapping to more than one pathway, as illustrated in FIG.20. The multiple mapping feature of KEGG makes it difficult to pinpointthe correct pathway when trying to link a compound to a specificpathway. At least seven of the compounds mapped to more than 10pathways, rendering the maps difficult to interpret (Table 13). Apathway score calculation is applied to the compounds to account forboth information content (number of pathways a compound maps to) andperturbation level.

[0394] The above describes a mapping approach to link the total data setfrom the four antifungal drugs to a biochemical pathway or pathwayswhich were perturbed under the experimental conditions applied. Due toinherent limitations of the KEGG database, the approach does not provideenough information for a complete analysis of the AF1 data. Therefore,the data from the four individual drug compounds was examined. As shownin Table 14, Amphoteracin B affects a much larger number of transcriptsand compounds in the yeast cells than do any of the other 3 compounds.TABLE 14 Number of Transcripts and Compounds Perturbed by Treatment #Transcripts # Compounds Chemical Treatment P < 0.1 P < 0.05 P < 0.1 P <0.05 Amphoteracin B 4652 4363 21 16 Ketoconazole 2026 1551 15 8Fluconazole 1719 1411 6 2 Posaconazole  925  690 4 3

[0395] This observation suggests that the site(s) of action associatedwith Amphoteracin B are likely to be more widespread throughout theyeast cells, rather than focussed specifically on one or a few (possiblyrelated) pathways. The other three drugs appear to have significantlyfewer effects, indicating that their modes of action are probably lessfar-reaching throughout the cellular processes of the yeast (andpossibly also less far-reaching for a patient receiving the compound asa drug therapy). Examination of both the transcript data and thecompound data presented in Table 14 leads to the conclusion thatAmphoteracin B affects many more yeast cellular pathways than doKetoconazole, Fluconazole, and Posaconazole, and that therefore, theeffects of Ketoconazole, Fluconazole, and Posaconazole are far morepathway-specific than that of Amphoteracin B.

[0396] The methods of the current invention require that data fromdifferent biological sources/technologies be considered together as onedata set in order to get the most biologically accurate andrepresentative information. An examination of the AF1 gene expressiondata alone gives a different impression than that obtained above whenboth the gene expression and the metabolite data were considered. Asshown in FIG. 21, gene expression analysis indicates that Posaconazolehas the most specific effect on the cell, and therefore might be thecompound least likely to have toxic side effects. Although the presentexperiment only examined yeast cells, and not human cells, it can beinterpolated that a compound affecting more biochemical pathways in ayeast cell might also be likely to affect more pathways in a human cell.Moreover, an experiment including human cells is straightforward toconduct, and is a logical follow-up to the AF1 study described herein.Examination of the AF1 gene expression data alone, as shown in FIG. 21,indicates that Posaconazole might be the compound of choice for safelytreating patients. When the gene expression data was classified intopathway mappings, as shown in Table 15, Pozaconazole appears to have themost specific effect, although this data indicates that Ketoconazole andFluconazole also have much more specific effects than Amphoteracin B.TABLE 15 Number of Pathways Affected by at Least One Gene # PathwaysChemical Treatment (p < 0.05) Amphoteracin B 97 Ketoconazole 90Fluconazole 79 Posaconazole 69

[0397] However, pathway analysis of the gene expression data shows thatin all of the treatments, including the three azoles and Amphoteracin B,pathways related to cell proliferation are up-regulated (data taken fromFIG. 21, in which the genes most perturbed were identified and linked topathways).

[0398] Inclusion of the metabolite data provides an improved analysisand supports the usefulness of the methods of the present invention.Based on the results shown in Table 14, Posaconazole is less specific inits effect than is Fluconazole. Analysis of this data alone leads to theconclusion that Fluconazole is the most specific acting of the fourantifungal drugs studied in AF1, and is therefore probably the drug ofchoice for safely treating patients.

[0399] The data were then combined to determine the number of reactionsshowing an enzyme and at least one compound perturbed, and to determinethe number of pathways having at least one enzyme and one perturbedcompound perturbed. The results of the analysis are represented inTables 16 and 17, and were difficult to interpret, illustrating that theability to draw conclusions from compound mapping to pathways is limitedwhen absent additional data. Analysis of this data does not lead to theconclusion that Fluconazole is the most specific acting of the fourantifungal drugs studied in AF1, but rather, indicates that Posaconazoleis the drug with the most specific effect. TABLE 16 Number of ReactionsHaving an Enzyme and at Least One Compound Perturbed Chemical Treatment# Reactions Amphoteracin B 54 Ketoconazole 21 Fluconazole  2Posaconazole  0

[0400] TABLE 17 Number Of Pathways Having at Least One Enzyme and OneCompound Perturbed Chemical Treatment # Reactions Amphoteracin B 37Ketoconazole 24 Fluconazole 15 Posaconazole  3

[0401] A coherent data set was created from data obtained from the fourabove-described drug compounds. The data were reduced by using principlecomponents analysis and cluster analysis. As shown in FIG. 22, the threeazole drugs cluster quite tightly together, indicating that their modesof action are more similar to each other than to the mode of action ofAmphoteracin B. The observed clustering is in direct contrast to thegene expression data, which showed by pathway analysis that in all ofthe treatments, including the three azoles and Amphoteracin B, pathwaysrelated to cell proliferation are up-regulated (data taken from FIG. 21,in which the genes most perturbed were identified and linked topathways).

[0402] A different analysis identified compounds perturbed in all fourof the treatments. Specifically, the analysis showed that squalene andlanosterol (plus a few unknown peaks) increased in the azolecompound-treated cells, but not in the Amphoteracin B-treated cells (seeFIG. 23 for information directed to the pathway). This observation leadsto the conclusion that the azole compounds are affecting the ergosterolpathway, a conclusion unsupported by gene expression data alone, whichinstead implicated cell proliferation pathways.

[0403] The AF1 example serves to support the methods and systems of thepresent invention by illustrating how the use of data from a singletechnology source provides, at best, a skewed image of biologicalreality. Reliance on a skewed conclusion may lead to deleteriouseffects, such as the administration of potentially dangerous and harmfulcompounds to patients. The AF1 example also serves to illustrate theproblems present in the current state of the art when linking gene andmetabolite data to specific biochemical pathways. It is invaluable tolink metabolite data, gene expression data, annotation, phenotype data,or any other type of information to a specific pathway, and ultimately,to a disease state. As illustrated in FIG. 1, one way to obtain a dataset that is meaningful and relevent to a biological system is to examineDNA, RNA, protein, metabolites, and phenotype, so that a comprehensivepicture of the biological status of an organism is obtained. The presentinvention provides methods and systems for creating coherent data sets,which are biologically relevent and meaningful, and which can serve asmodels of biological systems.

SPECIFIC EXAMPLE 6 Mouse Fibroblast Azole Drug Experiment

[0404] As noted above in Specific Example 5, ergosterol is an essentialcomponent of fungal plasma membranes; it affects membrane permeabilityand the activities of membrane-bound enzymes. In the present example,the methods of the invention are applied to an integrated genomic andmetabolomic approach to reveal the mode of action of anti-fungal drugs.Using cultured mouse fibroblasts (L929 cells) as a model system, theglobal metabolic consequences caused by the treatment of four antifungaldrugs (amphoteracin B, ketoconazole, fluconazole, and posaconazole) areexamined at both the transcriptome (RNA) and metabolome (small molecule)levels. The integrative analyses presents a global view of the metabolicchanges associated with each drug treatment, thus allowing for a betterinterpretation of the mode of action of antifungal drugs.

[0405] Materials and Methods

[0406] Strains and Media

[0407] L929 murine fibroblast cells were purchased from ATCC Catalog No.CCL-1. The L929 cell line is grown under standard conditions suggestedby ATCC guidelines (ATCC, Manassas, Va.). Cells are seeded in 75 cm²tissue culture flasks at a concentration that would yield 2.5-3.0×10⁶cells at treatment time. The cells are grown in DMEM:F12 (Sigma ChemicalCo., St. Louis, Mo.) supplemented with 1% L-Glutamine and 10% fetalbovine serum at 37° C., 4.9% CO₂ and 95% humidity for at least 36 hoursbefore treatment. The media is removed from the flasks and media withthe chosen concentration of drug chemical is added to the flasks. At thedesignated time point, the cells are harvested by centrifugationfollowing treatment with trypsin to release the cells. The pellet iswashed three times in Hanks' Balanced Salts Solution (HBSS, SigmaChemical Co., St. Louis, Mo.). Finally, the cells are resuspended in asmall volume of HBSS and transferred into 2 ml tubes. The samples arecentrifuged and the wash removed. Cell pellets are flash frozen inliquid nitrogen and stored at −800 C.

[0408] Determination of MIC

[0409] Antifungal drugs Amphotericin B, ketoconazole, and fluconazolewere purchased from Sigma (Sigma Chemical Co., St. Louis, Mo.), andposaconazole was a gift from Duke Medical Center (Duke University,Durham, N.C.). The minimal inhibitory concentration (MIC) is determinedusing 96-well plates seeded at a concentration of 20,0000 cells/well andgrown in DMEM:F12 (D6559, Sigma Chemical Co., St. Louis, Mo.)supplemented with 1% L-Glutamine and 10% FBS for 25 hours at 37° C.,4.9% CO₂ and 95% humidity. The cells are treated with each fungicide ina two fold dilution series with maximum concentration of 200 μg/ml. Eachplate contains L929 cells treated with 25 ng and 50 ng TNFα and cellsgrown in media only, 0.5% and 1% DMSO. Cell viability is determined byquantifying the amount of ATP in metabolically active cells usingCELLTITER-GLO Luminescent Cell Viability Assay (Promega Corp., Madison,Wis.). At the 24 hour time point, the media is removed from the wells,the cells are washed with PBS, and PBS is added to the wells. Promega'sprotocol for using the CELLTITER-GLO reagent is followed and theluminescence is measured on the Tecan Ultra luminometer (Tecan Systems,Inc., San Jose, Calif.).

[0410] RNA Extraction and Microarray Preparation

[0411] RNA is obtained from 2-10 million fresh frozen cells usingmethods that are well known in the art, such as a TRIZOL (GibcoBRL,Rockville, Md.) extraction method. Microarrays containing human genes,such as Agilent's (Agilent Technologies, Palo Alto, Calif.) cDNAMicroarray Kit (containing over 12,000 of Incyte's Human Drug Targetclones), are used for the hybridizations, according to themanufacturer's instructions.

[0412] Microarray Data Processing and Analyses

[0413] Data are analyzed using software such as Image Analysis Software(Version A.4.0.45, Agilent Technologies, Palo Alto, Calif.) and thenloaded into a database appropriate for storage and further analysis,such as the Rosetta RESOLVER database (Rosetta Inpharmatics Inc.,Kirkland, Wash.).

[0414] GC-MS Derivatization and Analyses

[0415] Approximately 500,000 cells are extracted in a solvent, convertedto trimethylsilyl derivatives in-situ, and analyzed by gaschromatography with time of flight mass spectrometry (GC/TOF-MS).Separations are conducted using a 50% phenyl-50% methyl stationaryphase, helium carrier gas, and a programmed oven temperature that rampsfrom a starting temperature of 50° C. to a final temperature of over300° C. Compounds detected by GC-MS with an electron impact (EI) ionsource are cataloged based on Kovats retention indices andmass-to-charge ratio (m/z) of the ions characteristic of each peak.Commercially available reference compounds were obtained fromSigma-Aldrich (Sigma Chemical Co., St. Louis, Mo.) or VWR (VWRScientific Products, Baltimore, Md.).

[0416] LC-MS Procedures

[0417] Approximately 500,000 cells are extracted in 0.5 ml 10% aqueousmethanol containing labeled internal standards. Tissue is disrupted by a30 second pulse of high level sonic energy (lithotripsy), at a maximumtemperature of 30° C. The extract is centrifuged at 4000 rpm for 2minutes. The supernatant, diluted with an equal volume of 50% aqueousacetonitrile (V/V) is chromatographed on C18 HPLC in anacetonitrile/water gradient containing 5 mM ammonium acetate. Samplesare passed through a splitter and the split flow is infused to theturbo-ionspray ionization sources of two Mariner LC TOF massspectrometers (PerSeptive Biosystems Inc., Framingham, Mass.). Thesources are optimized to generate and monitor positive and negativeions, respectively. The Total Ion Chromatogram (TIC) is analyzed forcompounds with masses ranging from 80 to 900 Da. Individual ion tracesare used for both calibration and quantification. Relative amounts ofthe compounds are determined using the intensity and peak areas ofindividual ion traces. Isotopically labeled internal standards are usedfor peak area ratios, response factor determination, and normalizationof data throughout the experiment.

[0418] Data Analysis

[0419] The data are analyzed according to the methods and systems of thecurrent invention. The data from each sample are assigned a uniqueidentifier, and are collected and stored in a computer tracking system,wherein the data are linked to the appropriate unique identifier. Alllinked data are converted to a numeric format, and the numeric data areconverted to a common unit system, wherein the common unit system dataare a coherent data set and can serve as a model for a biologicalsystem. Additionally, the coherent data set can be compared to areference population to determine the most informative results from theexperiment, so that a signature profile is established with the mostinformative results.

SPECIFIC EXAMPLE 7 Human Cell Azole Drug Experiment

[0420] Strains and Media

[0421] HepG2, a human hepatocyte line, is purchased from American TypeCulture Center (ATCC, Manassas, Va.). The hepatocyte strain is grownunder standard conditions as suggested by the ATCC guidelines (ATCC,Manassas, Va.). The media is removed from the flasks and media with thechosen concentration of drug chemical is added to the flasks. At thedesignated time point, the cells are harvested by centrifugationfollowing treatment with trypsin to release the cells. The pellet iswashed three times in Hanks' Balanced Salts Solution (HBSS, SigmaChemical Co., St. Louis, Mo.). Finally, the cells are resuspended in asmall volume of HBSS and transferred into 2 ml tubes. The samples arecentrifuged and the wash removed. Cell pellets are flash frozen inliquid nitrogen and stored at −80° C.

[0422] Determination of MIC

[0423] Antifungal drugs Amphotericin B, ketoconazole, and fluconazolewere purchased from Sigma (Sigma Chemical Co., St. Louis, Mo.), andposaconazole was a gift from Duke Medical Center (Duke University,Durham, N.C.). The minimal inhibitory concentration (MIC) is determinedusing 96-well plates seeded at a concentration of 20,0000 cells/well andgrown in DMEM:F12 (D6559, Sigma Chemical Co., St. Louis, Mo.)supplemented with 1% L-Glutamine and 10% FBS for 25 hours at 37° C.,4.9% CO₂ and 95% humidity. The cells are treated with each fungicide ina two fold dilution series with maximum concentration of 200 μg/ml. Eachplate contains HepG2 cells treated with 25 ng and 50 ng TNFα and cellsgrown in media only, 0.5% and 1% DMSO. Cell viability is determined byquantifying the amount of ATP in metabolically active cells usingCELLTITER-GLO Luminescent Cell Viability Assay (Promega Corp., Madison,Wis.). At the 24 hour time point, the media is removed from the wells,the cells are washed with PBS, and PBS is added to the wells. Promega'sprotocol for using the CELLTITER-GLO reagent is followed and theluminescence is measured on the Tecan Ultra luminometer (Tecan Systems,Inc., San Jose, Calif.).

[0424] RNA Extraction and Microarray Preparation

[0425] RNA is obtained from 2-10 million fresh frozen cells usingmethods that are well known in the art, such as a TRIZOL (GibcoBRL,Rockville, Md.) extraction method. Microarrays containing human genes,such as Agilent's (Agilent Technologies, Palo Alto, Calif.) cDNAMicroarray Kit (containing over 12,000 of Incyte's Human Drug Targetclones), are used for the hybridizations, according to themanufacturer's instructions.

[0426] Microarray Data Processing and Analyses

[0427] Data are analyzed using software such as Image Analysis Software(Version A.4.0.45, Agilent Technologies, Palo Alto, Calif.) and thenloaded into a database appropriate for storage and further analysis,such as the Rosetta RESOLVER database (Rosetta Inpharmatics Inc.,Kirkland, Wash.).

[0428] GC-MS Derivatization and Analyses

[0429] Approximately 500,000 cells are extracted in a solvent, convertedto trimethylsilyl derivatives in-situ, and analyzed by gaschromatography with time of flight mass spectrometry (GC/TOF-MS).Separations are conducted using a 50% phenyl-50% methyl stationaryphase, helium carrier gas, and a programmed oven temperature that rampsfrom a starting temperature of 50° C. to a final temperature of over300° C. Compounds detected by GC-MS with an electron impact (EI) ionsource are cataloged based on Kovats retention indices andmass-to-charge ratio (m/z) of the ions characteristic of each peak.Commercially available reference compounds were obtained fromSigma-Aldrich (Sigma Chemical Co., St. Louis, Mo.) or VWR (VWRScientific Products, Baltimore, Md.).

[0430] LC-MS Procedures

[0431] Approximately 500,000 cells are extracted in 0.5 ml 10% aqueousmethanol containing labeled internal standards. Tissue is disrupted by a30 second pulse of high level sonic energy (lithotripsy), at a maximumtemperature of 30° C. The extract is centrifuged at 4000 rpm for 2minutes. The supernatant, diluted with an equal volume of 50% aqueousacetonitrile (V/V) is chromatographed on C18 HPLC in anacetonitrile/water gradient containing 5 mM ammonium acetate. Samplesare passed through a splitter and the split flow is infused to theturbo-ionspray ionization sources of two Mariner LC TOF massspectrometers (PerSeptive Biosystems Inc., Framingham, Mass.). Thesources are optimized to generate and monitor positive and negativeions, respectively. The Total Ion Chromatogram (TIC) is analyzed forcompounds with masses ranging from 80 to 900 Da. Individual ion tracesare used for both calibration and quantification. Relative amounts ofthe compounds are determined using the intensity and peak areas ofindividual ion traces. Isotopically labeled internal standards are usedfor peak area ratios, response factor determination, and normalizationof data throughout the experiment.

[0432] Data Analysis

[0433] The data are analyzed according to the methods and systems of thecurrent invention. The data from each sample are assigned a uniqueidentifier, and are collected and stored in a computer tracking system,wherein the data are linked to the appropriate unique identifier. Alllinked data are converted to a numeric format, and the numeric data areconverted to a common unit system, wherein the common unit system dataare a coherent data set and can serve as a model for a biologicalsystem. Additionally, the coherent data set can be compared to areference population to determine the most informative results from theexperiment, so that a signature profile is established with the mostinformative results.

[0434] Further, the data from this experiment, Specific Example 7, arecombined with the data from Specific Example 5, for an analysis andcomparison of the effects of the four azole drugs on both the pathogen(the yeast cells in Specific Example 5) and the host (the human cells inSpecific Example 7). These types of analyses promise great utility inthe pharmaceutical arena, by streamlining the search for drug compoundsmost harmful to the pathogen and most efficacious to the patient/host.

[0435] Although the invention has been described with respect to apreferred embodiment thereof, it is to be also understood that it is notto be so limited since changes and modifications can be made thereinwhich are within the full intended scope of this invention as defined bythe appended claims.

We claim:
 1. A method for establishing a signature profile indicative ofthe physiological status of an individual, comprising: a) entering aunique identifier of at least one biological sample into a computertracking system; b) storing in said computer tracking system data fromsaid biological sample, wherein said data are linked to said uniqueidentifier; c) converting said linked data to a numeric format; d)converting said numeric format data to a common unit system, whereinsaid common unit system data is a coherent data set; and e) determiningthe most informative of said common unit system data; wherein said mostinformative data are a signature profile indicative of physiologicalstatus.
 2. The method according to claim 1, wherein the computertracking system is a Laboratory Information Management System (LIMS). 3.The method according to claim 1, wherein the biological sample isselected from the group consisting of animalia, plantae, protista,monera, and fungi.
 4. The method according to claim 3, wherein thebiological sample is selected from the group consisting of humanprimate, non-human primate, canine, feline, equine, bovine, porcine,rabbit, rodent, liver tissue, liver spheroids, primary hepatocytes,liver cell lines, and HepG2 cells.
 5. The method according to claim 3,wherein the biological sample is selected from the group consisting ofArabidopsis, corn, wheat, barley, rye, legumes, mint, tobacco, tomatoes,rice, spinach, and peas.
 6. The method according to claim 3, wherein thebiological sample is selected from the group consisting of Magnaporthe,Candida, Mycosphaerella, Botrytis, Saccharomyces, Aspergillus, Puccinia,Erysiphe, Ustilago, Fursarium, Phytophthora and Penicillium.
 7. Themethod according to claim 1, wherein said signature profile isindicative of a particular disease or disease stage.
 8. The methodaccording to claim 1, wherein said signature profile is indicative ofthe efficacy of a therapeutic program or exposure to a particularchemical.
 9. The method according to claim 1, wherein the biologicalsample is selected from the group consisting of a healthy organism, adiseased organism, a drug-treated organism, and a genetically alteredorganism.
 10. The method according to claim 1, wherein the biologicalsample is from an organism having received an environmental or chemicalinsult.
 11. The method according to claim 1, wherein the common unitsystem is deviation from a standard.
 12. A method for establishing asignature profile indicative of the physiological status of anindividual, comprising: a) entering a unique identifier of at least onebiological sample into a computer tracking system; b) storing in saidcomputer tracking system data from said biological sample, wherein saiddata are linked to said unique identifier; c) converting said linkeddata to a numeric format; d) transforming said numeric format data intoa Gaussian distribution; e) converting said Gaussian distribution datato a common unit system; f) reducing the dimensionality of said commonunit system data, wherein said dimensionally reduced common unit systemdata is a coherent data set; and g) determining the most informative ofsaid dimensionally reduced common unit system data; wherein said mostinformative data are a signature profile indicative of physiologicalstatus.
 13. The method according to claim 12, wherein the computertracking system is a Laboratory Information Management System (LIMS).14. The method according to claim 12, wherein the biological sample isselected from the group consisting of animalia, plantae, protista,monera, and fungi.
 15. The method according to claim 14, wherein thebiological sample is selected from the group consisting of humanprimate, non-human primate, canine, feline, equine, bovine, porcine,rabbit, rodent, liver tissue, liver spheroids, primary hepatocytes,liver cell lines, and HepG2 cells.
 16. The method according to claim 14,wherein the biological sample is selected from the group consisting ofArabidopsis, corn, wheat, barley, rye, legumes, mint, tobacco, tomatoes,rice, spinach, and peas.
 17. The method according to claim 14, whereinthe biological sample is selected from the group consisting ofMagnaporthe, Candida, Mycosphaerella, Botrytis, Saccharomyces,Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium, Phytophthora andPenicillium.
 18. The method according to claim 12, wherein saidsignature profile is indicative of a particular disease or diseasestage.
 19. The method according to claim 12, wherein said signatureprofile is indicative of the efficacy of a therapeutic program orexposure to a particular chemical.
 20. The method according to claim 12,wherein the biological sample is selected from the group consisting of ahealthy organism, a diseased organism, a drug-treated organism, and agenetically altered organism.
 21. The method according to claim 12,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 22. The method according to claim 12,wherein the common unit system is deviation from a standard.
 23. Themethod according to claim 12, wherein said reduction of dimensionalityis achieved by applying one of the group consisting of principlecomponents analysis, correlation analysis, regression analysis, andpre-clustering of said common unit system data.
 24. The method accordingto claim 12, wherein said transformation into a Gaussian distributionoccurs by conversion of said numeric format data to a logarithmic scale.25. A method for establishing a signature profile indicative of thephysiological status of an individual, comprising: a) entering a uniqueidentifier of at least one biological sample into a computer trackingsystem; b) storing in said computer tracking system data from saidbiological sample, wherein said data are linked to said uniqueidentifier; c) converting said linked data to a numeric format; d)transforming said numeric format data into a Gaussian distribution; e)converting said Gaussian distribution data to a common unit systemwherein said common unit system data is a coherent data set; and f)determining the most informative of said common unit system data;wherein said most informative data are a signature profile indicative ofphysiological status.
 26. The method according to claim 25, wherein thecomputer tracking system is a Laboratory Information Management System(LIMS).
 27. The method according to claim 25, wherein the biologicalsample is selected from the group consisting of animalia, plantae,protista, monera, and fungi.
 28. The method according to claim 27,wherein the biological sample is selected from the group consisting ofhuman primate, non-human primate, canine, feline, equine, bovine,porcine, rabbit, rodent, liver tissue, liver spheroids, primaryhepatocytes, liver cell lines, and HepG2 cells.
 29. The method accordingto claim 27, wherein the biological sample is selected from the groupconsisting of Arabidopsis, corn, wheat, barley, rye, legumes, mint,tobacco, tomatoes, rice, spinach, and peas.
 30. The method according toclaim 27, wherein the biological sample is selected from the groupconsisting of Magnaporthe, Candida, Mycosphaerella, Botrytis,Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium,Phytophthora and Penicillium.
 31. The method according to claim 25,wherein said signature profile is indicative of a particular disease ordisease stage.
 32. The method according to claim 25, wherein saidsignature profile is indicative of the efficacy of a therapeutic programor exposure to a particular chemical.
 33. The method according to claim25, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 34. The method according to claim 25,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 35. The method according to claim 25,wherein the common unit system is deviation from a standard.
 36. Themethod according to claim 25, wherein said transformation into aGaussian distribution occurs by conversion of said numeric format datato a logarithmic scale.
 37. A method for establishing a signatureprofile indicative of the physiological status of an individual,comprising: a) entering a unique identifier of at least one biologicalsample into a computer tracking system; b) storing in said computertracking system data from said biological sample, wherein said data arelinked to said unique identifier; c) converting said linked data to anumeric format; d) converting said numeric format data to a common unitsystem; e) reducing the dimensionality of said common unit system data,wherein said dimensionally reduced data is a coherent data set; and f)determining the most informative of said dimensionally reduced data;wherein said most informative data are a signature profile indicative ofphysiological status.
 38. The method according to claim 37, wherein thecomputer tracking system is a Laboratory Information Management System(LIMS).
 39. The method according to claim 37, wherein the biologicalsample is selected from the group consisting of animalia, plantae,protista, monera, and fungi.
 40. The method according to claim 39,wherein the biological sample is selected from the group consisting ofhuman primate, non-human primate, canine, feline, equine, bovine,porcine, rabbit, rodent, liver tissue, liver spheroids, primaryhepatocytes, liver cell lines, and HepG2 cells.
 41. The method accordingto claim 39, wherein the biological sample is selected from the groupconsisting of Arabidopsis, corn, wheat, barley, rye, legumes, mint,tobacco, tomatoes, rice, spinach, and peas.
 42. The method according toclaim 39, wherein the biological sample is selected from the groupconsisting of Magnaporthe, Candida, Mycosphaerella, Botrytis,Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium,Phytophthora and Penicillium.
 43. The method according to claim 37,wherein said signature profile is indicative of a particular disease ordisease stage.
 44. The method according to claim 37, wherein saidsignature profile is indicative of the efficacy of a therapeutic programor exposure to a particular chemical.
 45. The method according to claim37, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 46. The method according to claim 37,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 47. The method according to claim 37,wherein the common unit system is deviation from a standard.
 48. Themethod according to claim 37, wherein said reduction of dimensionalityis achieved by applying one of the group consisting of principlecomponents analysis, correlation analysis, regression analysis, andpre-clustering of said common unit system data.
 49. A method forestablishing a signature profile indicative of the physiological statusof an individual, comprising: a) entering a unique identifier of atleast one biological sample into a computer tracking system; b) storingin said computer tracking system disparate data, wherein said disparatedata comprise at least two types of data and said disparate data arelinked to said unique identifier; c) converting said linked disparatedata to a numeric format; d) converting said numeric format data to acommon unit system, wherein said common unit system data is a coherentdata set; and e) determining the most informative of said common unitsystem data; wherein said most informative data are a signature profileindicative of physiological status.
 50. The method according to claim49, wherein said at least two types of data are obtained from the groupconsisting of RNA data, DNA data, protein data, metabolite data, andphenotypic data.
 51. The method according to claim 49, wherein thecomputer tracking system is a Laboratory Information Management System(LIMS).
 52. The method according to claim 49, wherein the biologicalsample is selected from the group consisting of animalia, plantae,protista, monera, and fungi.
 53. The method according to claim 52,wherein the biological sample is selected from the group consisting ofhuman primate, non-human primate, canine, feline, equine, bovine,porcine, rabbit, rodent, liver tissue, liver spheroids, primaryhepatocytes, liver cell lines, and HepG2 cells.
 54. The method accordingto claim 52, wherein the biological sample is selected from the groupconsisting of Arabidopsis, corn, wheat, barley, rye, legumes, mint,tobacco, tomatoes, rice, spinach, and peas.
 55. The method according toclaim 52, wherein the biological sample is selected from the groupconsisting of Magnaporthe, Candida, Mycosphaerella, Botrytis,Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium,Phytophthora and Penicillium.
 56. The method according to claim 49,wherein said signature profile is indicative of a particular disease ordisease stage.
 57. The method according to claim 49, wherein saidsignature profile is indicative of the efficacy of a therapeutic programor exposure to a particular chemical.
 58. The method according to claim49, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 59. The method according to claim 49,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 60. The method according to claim 49,wherein the common unit system is deviation from a standard.
 61. Amethod for establishing a signature profile indicative of thephysiological status of an individual, comprising: a) entering a uniqueidentifier of at least one biological sample into a computer trackingsystem; b) storing in said computer tracking system disparate data,wherein said disparate data comprise at least two types of data and saiddisparate data are linked to said unique identifier; c) converting saidlinked disparate data to a numeric format; d) transforming said numericformat data into a Gaussian distribution; e) converting said Gaussiandistribution data to a common unit system; f) reducing thedimensionality of said common unit system data, wherein saiddimensionally reduced data is a coherent data set; and g) determiningthe most informative of said dimensionally reduced data; wherein saidmost informative data are a signature profile indicative ofphysiological status.
 62. The method according to claim 61, wherein saidat least two types of data are obtained from the group consisting of RNAdata, DNA data, protein data, metabolite data, and phenotypic data. 63.The method according to claim 61, wherein the computer tracking systemis a Laboratory Information Management System (LIMS).
 64. The methodaccording to claim 61, wherein the biological sample is selected fromthe group consisting of animalia, plantae, protista, monera, and fungi.65. The method according to claim 64, wherein the biological sample isselected from the group consisting of human primate, non-human primate,canine, feline, equine, bovine, porcine, rabbit, rodent, liver tissue,liver spheroids, primary hepatocytes, liver cell lines, and HepG2 cells.66. The method according to claim 64, wherein the biological sample isselected from the group consisting of Arabidopsis, corn, wheat, barley,rye, legumes, mint, tobacco, tomatoes, rice, spinach, and peas.
 67. Themethod according to claim 64, wherein the biological sample is selectedfrom the group consisting of Magnaporthe, Candida, Mycosphaerella,Botrytis, Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago,Fursarium, Phytophthora and Penicillium.
 68. The method according toclaim 61, wherein said signature profile is indicative of a particulardisease or disease stage.
 69. The method according to claim 61, whereinsaid signature profile is indicative of the efficacy of a therapeuticprogram or exposure to a particular chemical.
 70. The method accordingto claim 61, wherein the biological sample is selected from the groupconsisting of a healthy organism, a diseased organism, a drug-treatedorganism, and a genetically altered organism.
 71. The method accordingto claim 61, wherein the biological sample is from an organism havingreceived an environmental or chemical insult.
 72. The method accordingto claim 61, wherein the common unit system is deviation from astandard.
 73. The method according to claim 61, wherein said reductionof dimensionality is achieved by applying one of the group consisting ofprinciple components analysis, correlation analysis, regressionanalysis, and pre-clustering of said common unit system data.
 74. Themethod according to claim 61, wherein said transformation into aGaussian distribution occurs by conversion of said numeric format datato a logarithmic scale.
 75. A method for establishing a signatureprofile indicative of the physiological status of an individual,comprising: a) entering a unique identifier of at least one biologicalsample into a computer tracking system; b) storing in said computertracking system disparate data, wherein said disparate data comprise atleast two types of data and said disparate data are linked to saidunique identifier; c) converting said linked disparate data to a numericformat; d) converting said numeric format data to a common unit system;e) reducing the dimensionality of said common unit system data, whereinsaid dimensionally reduced data is a coherent data set; and f)determining the most informative of said dimensionally reduced data;wherein said most informative data are a signature profile indicative ofphysiological status.
 76. The method according to claim 75, wherein saidat least two types of data are obtained from the group consisting of RNAdata, DNA data, protein data, metabolite data, and phenotypic data. 77.The method according to claim 75, wherein the computer tracking systemis a Laboratory Information Management System (LIMS).
 78. The methodaccording to claim 75, wherein the biological sample is selected fromthe group consisting of animalia, plantae, protista, monera, and fungi.79. The method according to claim 78, wherein the biological sample isselected from the group consisting of human primate, non-human primate,canine, feline, equine, bovine, porcine, rabbit, rodent, liver tissue,liver spheroids, primary hepatocytes, liver cell lines, and HepG2 cells.80. The method according to claim 78, wherein the biological sample isselected from the group consisting of Arabidopsis, corn, wheat, barley,rye, legumes, mint, tobacco, tomatoes, rice, spinach, and peas.
 81. Themethod according to claim 78, wherein the biological sample is selectedfrom the group consisting of Magnaporthe, Candida, Mycosphaerella,Botrytis, Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago,Fursarium, Phytophthora and Penicillium.
 82. The method according toclaim 75, wherein said signature profile is indicative of a particulardisease or disease stage.
 83. The method according to claim 75, whereinsaid signature profile is indicative of the efficacy of a therapeuticprogram or exposure to a particular chemical.
 84. The method accordingto claim 75, wherein the biological sample is selected from the groupconsisting of a healthy organism, a diseased organism, a drug-treatedorganism, and a genetically altered organism.
 85. The method accordingto claim 75, wherein the biological sample is from an organism havingreceived an environmental or chemical insult.
 86. The method accordingto claim 75, wherein the common unit system is deviation from astandard.
 87. The method according to claim 75, wherein said reductionof dimensionality is achieved by applying one of the group consisting ofprinciple components analysis, correlation analysis, regressionanalysis, and pre-clustering of said common unit system data.
 88. Amethod for establishing a signature profile indicative of thephysiological status of an individual, comprising: a) entering a uniqueidentifier of at least one biological sample into a computer trackingsystem; b) storing in said computer tracking system disparate data,wherein said disparate data comprise at least two types of data and saiddisparate data are linked to said unique identifier; c) converting saidlinked disparate data to a numeric format; d) transforming said numericformat data into a Gaussian distribution; e) converting said Gaussiandistribution data to a common unit system, wherein said common unitsystem data is a coherent data set; and f) determining the mostinformative of said common unit system data; wherein said mostinformative data are a signature profile indicative of physiologicalstatus.
 89. The method according to claim 88, wherein said at least twotypes of data are obtained from the group consisting of RNA data, DNAdata, protein data, metabolite data, and phenotypic data.
 90. The methodaccording to claim 88, wherein the computer tracking system is aLaboratory Information Management System (LIMS).
 91. The methodaccording to claim 88, wherein the biological sample is selected fromthe group consisting of animalia, plantae, protista, monera, and fungi.92. The method according to claim 91, wherein the biological sample isselected from the group consisting of human primate, non-human primate,canine, feline, equine, bovine, porcine, rabbit, rodent, liver tissue,liver spheroids, primary hepatocytes, liver cell lines, and HepG2 cells.93. The method according to claim 91, wherein the biological sample isselected from the group consisting of Arabidopsis, corn, wheat, barley,rye, legumes, mint, tobacco, tomatoes, rice, spinach, and peas.
 94. Themethod according to claim 91, wherein the biological sample is selectedfrom the group consisting of Magnaporthe, Candida, Mycosphaerella,Botrytis, Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago,Fursarium, Phytophthora and Penicillium.
 95. The method according toclaim 88, wherein said signature profile is indicative of a particulardisease or disease stage.
 96. The method according to claim 88, whereinsaid signature profile is indicative of the efficacy of a therapeuticprogram or exposure to a particular chemical.
 97. The method accordingto claim 88, wherein the biological sample is selected from the groupconsisting of a healthy organism, a diseased organism, a drug-treatedorganism, and a genetically altered organism.
 98. The method accordingto claim 88, wherein the biological sample is from an organism havingreceived an environmental or chemical insult.
 99. The method accordingto claim 88, wherein the common unit system is deviation from astandard.
 100. The method according to claim 88, wherein saidtransformation into a Gaussian distribution occurs by conversion of saidnumeric format data to a logarithmic scale.
 101. A method forestablishing a signature profile indicative of the physiological statusof an individual, comprising: a) entering a unique identifier of atleast one biological sample into a computer tracking system; b) storingin said computer tracking system disparate data, wherein said disparatedata comprise at least three types of data and said disparate data arelinked to said unique identifier; c) converting said linked disparatedata to a numeric format; d) converting said numeric format data to acommon unit system, wherein said common unit system data is a coherentdata set; and e) determining the most informative of said common unitsystem data; wherein said most informative data are a signature profileindicative of physiological status.
 102. The method according to claim101, wherein said at least three types of data are obtained from thegroup consisting of RNA data, DNA data, protein data, metabolite data,and phenotypic data.
 103. The method according to claim 101, wherein thecomputer tracking system is a Laboratory Information Management System(LIMS).
 104. The method according to claim 101, wherein the biologicalsample is selected from the group consisting of animalia, plantae,protista, monera, and fungi.
 105. The method according to claim 104,wherein the biological sample is selected from the group consisting ofhuman primate, non-human primate, canine, feline, equine, bovine,porcine, rabbit, rodent, liver tissue, liver spheroids, primaryhepatocytes, liver cell lines, and HepG2 cells.
 106. The methodaccording to claim 104, wherein the biological sample is selected fromthe group consisting of Arabidopsis, corn, wheat, barley, rye, legumes,mint, tobacco, tomatoes, rice, spinach, and peas.
 107. The methodaccording to claim 104, wherein the biological sample is selected fromthe group consisting of Magnaporthe, Candida, Mycosphaerella, Botrytis,Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium,Phytophthora and Penicillium.
 108. The method according to claim 101,wherein said signature profile is indicative of a particular disease ordisease stage.
 109. The method according to claim 101, wherein saidsignature profile is indicative of the efficacy of a therapeutic programor exposure to a particular chemical.
 110. The method according to claim101, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 111. The method according to claim 101,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 112. The method according to claim101, wherein the common unit system is deviation from a standard.
 113. Amethod for establishing a signature profile indicative of thephysiological status of an individual, comprising: a) entering a uniqueidentifier of at least one biological sample into a computer trackingsystem; b) storing in said computer tracking system disparate data,wherein said disparate data comprise at least three types of data andsaid disparate data are linked to said unique identifier; c) convertingsaid linked disparate data to a numeric format; d) transforming saidnumeric format data into a Gaussian distribution; e) converting saidGaussian distribution data to a common unit system; f) reducing thedimensionality of said common unit system data, wherein saiddimensionally reduced data is a coherent data set; and g) determiningthe most informative of said dimensionally reduced data; wherein saidmost informative data are a signature profile indicative ofphysiological status.
 114. The method according to claim 113, whereinsaid at least three types of data are obtained from the group consistingof RNA data, DNA data, protein data, metabolite data, and phenotypicdata.
 115. The method according to claim 113, wherein the computertracking system is a Laboratory Information Management System (LIMS).116. The method according to claim 113, wherein the biological sample isselected from the group consisting of animalia, plantae, protista,monera, and fungi.
 117. The method according to claim 116, wherein thebiological sample is selected from the group consisting of humanprimate, non-human primate, canine, feline, equine, bovine, porcine,rabbit, rodent, liver tissue, liver spheroids, primary hepatocytes,liver cell lines, and HepG2 cells.
 118. The method according to claim116, wherein the biological sample is selected from the group consistingof Arabidopsis, corn, wheat, barley, rye, legumes, mint, tobacco,tomatoes, rice, spinach, and peas.
 119. The method according to claim116, wherein the biological sample is selected from the group consistingof Magnaporthe, Candida, Mycosphaerella, Botrytis, Saccharomyces,Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium, Phytophthora andPenicillium.
 120. The method according to claim 113, wherein saidsignature profile is indicative of a particular disease or diseasestage.
 121. The method according to claim 113, wherein said signatureprofile is indicative of the efficacy of a therapeutic program orexposure to a particular chemical.
 122. The method according to claim113, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 123. The method according to claim 113,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 124. The method according to claim113, wherein the common unit system is deviation from a standard. 125.The method according to claim 113, wherein said reduction ofdimensionality is achieved by applying one of the group consisting ofprinciple components analysis, correlation analysis, regressionanalysis, and pre-clustering of said common unit system data.
 126. Themethod according to claim 113, wherein said transformation into aGaussian distribution occurs by conversion of said numeric format datato a logarithmic scale.
 127. A method for establishing a signatureprofile indicative of the physiological status of an individual,comprising: a) entering a unique identifier of at least one biologicalsample into a computer tracking system; b) storing in said computertracking system disparate data, wherein said disparate data comprise atleast three types of data and said disparate data are linked to saidunique identifier; c) converting said linked disparate data to a numericformat; d) converting said numeric format data to a common unit system;e) reducing the dimensionality of said common unit system data, whereinsaid dimensionally reduced common unit system data is a coherent dataset; and f) determining the most informative of said dimensionallyreduced data; wherein said most informative data are a signature profileindicative of physiological status.
 128. The method according to claim127, wherein said at least three types of data are obtained from thegroup consisting of RNA data, DNA data, protein data, metabolite data,and phenotypic data.
 129. The method according to claim 127, wherein thecomputer tracking system is a Laboratory Information Management System(LIMS).
 130. The method according to claim 127, wherein the biologicalsample is selected from the group consisting of animalia, plantae,protista, monera, and fungi.
 131. The method according to claim 130,wherein the biological sample is selected from the group consisting ofhuman primate, non-human primate, canine, feline, equine, bovine,porcine, rabbit, rodent, liver tissue, liver spheroids, primaryhepatocytes, liver cell lines, and HepG2 cells.
 132. The methodaccording to claim 130, wherein the biological sample is selected fromthe group consisting of Arabidopsis, corn, wheat, barley, rye, legumes,mint, tobacco, tomatoes, rice, spinach, and peas.
 133. The methodaccording to claim 130, wherein the biological sample is selected fromthe group consisting of Magnaporthe, Candida, Mycosphaerella, Botrytis,Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago, Fursarium,Phytophthora and Penicillium.
 134. The method according to claim 127,wherein said signature profile is indicative of a particular disease ordisease stage.
 135. The method according to claim 127, wherein saidsignature profile is indicative of the efficacy of a therapeutic programor exposure to a particular chemical.
 136. The method according to claim127, wherein the biological sample is selected from the group consistingof a healthy organism, a diseased organism, a drug-treated organism, anda genetically altered organism.
 137. The method according to claim 127,wherein the biological sample is from an organism having received anenvironmental or chemical insult.
 138. The method according to claim127, wherein the common unit system is deviation from a standard. 139.The method according to claim 127, wherein said reduction ofdimensionality is achieved by applying one of the group consisting ofprinciple components analysis, correlation analysis, regressionanalysis, and pre-clustering of said common unit system data.
 140. Amethod for establishing a signature profile indicative of thephysiological status of an individual, comprising: a) entering a uniqueidentifier of at least one biological sample into a computer trackingsystem; b) storing in said computer tracking system disparate data,wherein said disparate data comprise at least three types of data andsaid disparate data are linked to said unique identifier; c) convertingsaid linked disparate data to a numeric format; d) transforming saidnumeric format data into a Gaussian distribution; e) converting saidGaussian distribution data to a common unit system, wherein said commonunit system data is a coherent data set; and f) determining the mostinformative of said common unit system data; wherein said mostinformative data are a signature profile indicative of physiologicalstatus.
 141. The method according to claim 140, wherein said at leastthree types of data are obtained from the group consisting of RNA data,DNA data, protein data, metabolite data, and phenotypic data.
 142. Themethod according to claim 140, wherein the computer tracking system is aLaboratory Information Management System (LIMS).
 143. The methodaccording to claim 140, wherein the biological sample is selected fromthe group consisting of animalia, plantae, protista, monera, and fungi.144. The method according to claim 143, wherein the biological sample isselected from the group consisting of human primate, non-human primate,canine, feline, equine, bovine, porcine, rabbit, rodent, liver tissue,liver spheroids, primary hepatocytes, liver cell lines, and HepG2 cells.145. The method according to claim 143, wherein the biological sample isselected from the group consisting of Arabidopsis, corn, wheat, barley,rye, legumes, mint, tobacco, tomatoes, rice, spinach, and peas.
 146. Themethod according to claim 143, wherein the biological sample is selectedfrom the group consisting of Magnaporthe, Candida, Mycosphaerella,Botrytis, Saccharomyces, Aspergillus, Puccinia, Erysiphe, Ustilago,Fursarium, Phytophthora and Penicillium.
 147. The method according toclaim 140, wherein said signature profile is indicative of a particulardisease or disease stage.
 148. The method according to claim 140,wherein said signature profile is indicative of the efficacy of atherapeutic program or exposure to a particular chemical.
 149. Themethod according to claim 140, wherein the biological sample is selectedfrom the group consisting of a healthy organism, a diseased organism, adrug-treated organism, and a genetically altered organism.
 150. Themethod according to claim 140, wherein the biological sample is from anorganism having received an environmental or chemical insult.
 151. Themethod according to claim 140, wherein the common unit system isdeviation from a standard.
 152. The method according to claim 140,wherein said transformation into a Gaussian distribution occurs byconversion of said numeric format data to a logarithmic scale.